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REMARKS 

I. Status of the Claims 

Claims 2, 3 and 4 are amended. 
Claims 2-4, 6 and 7 are pending. 

Claims 9-15 were restricted out reserving the right to prosecute them in a continuing 
application. 

Applicants' arguments filed on June 13, 2002 overcame all previous rejections. The 
following remarks relate to "newly applied" rejections. 

II. Goal of the Invention 

The examiner appears to have lost sight of a goal of the invention which is to inhibit 
bacterial IMPDH without inhibiting mammalian IMPDH. This is because required inhibitors of 
bacterial IMPDH inhibit or kill bacteria that have infected mammalian hosts, without harming the 
mammalian host. The present disclosure relates for the first time, structural differences between 
bacterial and mammalian IMPDH paving the way toward inhibiting bacteria without inhibiting a 
mammalian host. S. pyogenes was used as a representative sample of bacteria because it is a 
pathogen, and has only one cysteine at the active site rather than 2, simplifying some technical 
steps, and had similar IMPDH amino acid sequences compared to other pathogenic bacteria. A 
model for a binding pocket, based on S. pyogenes will allow finding binding pockets in other 
bacteria without undue experimentation. 

Identification of differences in the catalytic pocket (binding pocket) between bacterial and 
mammalian IMPDH provides focus for designing inhibitors of bacteria IMPDH other than S. 
pyogenes because bacterial species will be more similar to each other than to mammals. Extensive 
comparisons of IMPDH structure among bacteria and between mammals are disclosed. 

On page 8-9, methods are disclosed for determining bacterial IMPDH crystal structure. A 
map shows a "clearly defined electron density for the IMPDH substrate, bound in the catalytic 
site," (page 9, lines 22-24). Based on this invention, a model is now available for bacterial 
IMPDH which provides a framework on which differences among bacteria may be sought. 

"The high-resolution (1 .9A) crystal structure of S. pyogenes IMPDH dehydrogenase allows 
examination of the catalytic site in greater detail than it was possible previously" (pages 14, & 
9-1 1). Differences from the structure of bacterial IMPDH from Chinese hamster of Sintchak are 
described (page 9, lines 19-22). 

III. Claim 3 is Amended as an Independent Claim so is Allowable 



Serial No: 09/533,466 



7- 



Attorney Docket No . 2141 6/90042 



Applicants thank the examiner for finding claim 3 allowable if rewritten in an independent 
form. It has been so amended, therefore applicant requests that it is allowed. 

IV, Claims 2-4, 6 and 7 are Not Vague or Indefinite 

The examiner complains that "bacterial IMPDH" is confusing, yet recommends that very 
phase in the proposed new title (see amendment to the specification), suggesting that "bacterial" is 
a clear modifier. It is clear from the specification (page 2, lines 1-22) to those of skill in the art that 
"bacterial" means IMPDH encoded by a bacterial genome and to an enzyme present in the bacteria 
which when inhibited will kill or inhibit the bacteria, not the mammalian host. Claims are 
interpreted in light of the specification, therefore this phrase remains. 

V. Claims 2-4, 6 and 7 Are Enabled 

Claim 4 was previously allowed. Claim 4 is also enabled. The atomic coordinates of Table 
7 provides a model from which other bacterial binding pockets can be derived. Claim 3 is 
allowable, therefore, claim 2, which depends on claim 3 should also be enabled and allowed. 
Claim 6 and 7 are enabled because the examiner admits Table 7 is enabling. 

The examiner provides no justification for his requirement that to be enabling a binding 

pocket must be demonstrated in other enzymes. This invention relates for the first time, a crystal 

structure of a bacterial IMPDH consequently a method for developing lead compounds. 

In order to define a binding pocket, the atomic coordinates of the 
amino acid atoms surrounding such a pocket must be defined as well 
as analyzed regarding ligand binding in order to determine which 
amino acids interact sufficiently with a typical ligand or structurally 
support or define said pocket so as to be binding pocket amino acids. 
This multilevel analysis is complex as it is three dimensional as 
well as requiring typical ligand shape information also. Such 
complex analysis is deemed undue experimentation without at least 
having a 3-dimensional structure as a guide. 

Contrary to the examiner's assertion, the invention provides a general model for a 
3-dimensional structure. Those of skill in the art know that a well diffracting crystal is an essential 
element in providing x-ray diffraction patterns. The present invention provides the well diffracted 
crystals and discloses methods for x-ray diffraction patterns to practice the invention. These 
crystals are not limited to binding pockets as defined in claims 6 and 7. Claims 6 and 7 are an 
embodiment. S. pyogenes -IMPDH is a representative bacterial enzyme. 

The specification clearly provides details on screening procedures used to obtain well 
diffracting crystals. Making the crystal allowed determinable of the atomic coordinates, the 
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binding pocket and other structural features. Because the generation of crystal is stochastic 
process, the methods outlined in the specification represent the accepted approach for the 
generation of well diffracting crystals. The examiner provides no support for any "undue 
experimentation" requirement and as the Court in Wands stated, "routine experimentation" does 
not mean it is "undue". 

The test [for undue experimentation] is not merely quantitative, since a considerable 
amount of experimentation is permissible, if it is merely routine, or if the specification in question 
provides a reasonable amount of guidance with respect to the direction in which the 
experimentation should proceed to enable the determination of how to practice a desired 
embodiment of the claimed invention. Johns Hopkins Univ. v. Cellpro, Inc., 152 F.3d 1342, 1360 
(Fed. Cir. \99$){c\tmg PPG Indus., Inc. v. Gardian Indus. Corp., 75 F.3d 1 158, 1564, 37 USPQ2d 
(BNA) 1618, 1623 (Fed. Cir. 1996)). 

The present invention is enabling according to Wands. 

Breadth of Claims: S. pyogenes is a model for bacteria based on homology of IMPDH 
sequences. 

Nature of the Invention: Developing lead compounds using S. pyogenes binding pocket 
as a model will be applicable to methods for finding inhibitors for other bacterial IMPDHs. 

State of the Prior Art: This is the first disclosure of structural characteristics of a bacterial 
IMPDH. No other model is available that would lead to binding pockets and inhibitors selective 
of bacteria. 

Level of Ordinary Skill: Publications cited in the disclosure and the exhibits in 
Responses illustrated the level of skill. 

Predictability: Now that a model is disclosed, because of sequence homology and 
predictable 3-D structures, finding binding pockets to guide development of lead compound 
requires only routine experimentation. 

Direction Provided: The inventor disclosed how to make an IMPDH crystal, to determine 
an atomic map, and develop lead compounds based on the binding pocket. 

Working Examples: S. pyogenes. 

Quantity of Experiment: The examiner has not demonstrated that more than routine 
experimentation is required. 

The technique of molecular replacement is well known to those skilled in the art and can 
be used obtaining initial phasing for an unknown structure from a known, structurally related 
molecule (J.P. Turkenburg and E.J. Dodson Modern developments in molecular replacement. 
Curr. Opin. Struct. Biol. 1996 Oct;6(5):604-10). When there is a certain level of sequence 
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homology, and the coordinates of the binding pocket of the species is known, 3-D structure is 
expected to be similar, and an inhibitor that works in the first species is a lead candidate for a 
inhibitor in a related species. Therefore, Dr. Collart is entitled to claim scope for developing lead 
compounds in bacteria, not just in S. pyogenes. 

The examiner has not explained why the previous arguments supported by exhibits A and 
B were not persuasive that claim 4 is enabled. Therefore, in addition to new arguments, the 
applicant repeats that argument and re-submits the exhibits herein for consideration. 

A variety of molecular docking programs are available (two are provided herein as Exhibits 
A and B) to enable those skilled in the art to screen compounds to identify potential inhibitors 
when provided by the information in the present invention. These programs are a common 
component of most structural biology software (e.g. Insight II from Molecular Simulations Inc.). 
Most of the newer programs provide for automated docking of ligands to receptors in the 
structure-based drug design process. 

A variety of purine analogues and mammalian inhibitors have been reported in the 
literature and would be likely candidates to begin a search for specific inhibitors of bacterial 
IMPDH enzymes. However, universities (e.g. University of Georgia) and commercial entities 
(Vertex Pharmaceuticals) have, and are continuing to construct, small molecule digital libraries. 
The present invention provides the structural coordinates of bacterial IMPDH that enables the 
screening of these candidate compounds/digital libraries also. 

V. The Invention of Claim 4 is to a method, not a composition. 

Claim 4 was previously allowed. Now, claim 4 was also rejected as lacking written 
description. However, the specification clearly supports possession of the claimed method. All 
the steps have been done. To maintain this rejection the examiner should explain why he doubts 
Dr. Collart has possession of the claimed method. 

VI. Sintchak Does Not Teach IMPDH from a Bacteria 

In fact, Sintchak does not even teach bacterial IMPDH. Sintchak teaches mammalian 
IMPDH from which a distinguishing model is sought in bacteria. 

A comparison of S. pyogenes enzymes IMPDH show only a 35% identity with the Chinese 
hamster IMPDH of Sintchak. This level of homology did not enable interpretation of the S. 
pyogenes model (specification page 19, lines 24-28). A similar case was observed for the structure 
of IMPDH from the protozoan (not a bacteria as the examiner mistakenly states on page 14 of the 
Action) Tritrichomonas foetus that shows a similar level of identity to the S. pyogenes and Chinese 
hamster enzymes. If molecular replacement cannot be used for determination of the structure, the 
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information used by Sintchak cannot be used to delineate the structure of the IMPDH enzymes 
obtained from bacteria. Those of skill in the art know that molecular replacement is generally only 
applicable for homologues with identity >50%. Homologues with lower identity can not reliably 
be used to determine structure of homologues used for building of a model as is described in the 
specification. 

VII. Wilson Does Not Teach a Method for Developing a Lead Compound 

In fact, Wilson relates IMPDH Type II. Mammals have two forms of IMPDH (Type I and 
II). Bacteria have only a single IMPDH enzyme. There is only 35% identity between the bacterial 
and both human enzymes making it impossible to use the crystal model of Wilson to solve the 
bacterial IMPDH molecule structures. 

VIII. A Prima Facie Case of Obviousness is Not Established 

To properly combine two references to reach a conclusion of obviousness, there must be 
some teaching, suggestion or inference in either or both of the references, or knowledge generally 
available to one skilled in the art, which would have led one to combine the relevant teachings of 
the two references. Ashland Oil, Inc. v. Delta Resins and Refractories, Inc. et al (CAFC 1985) 
776 F. 2d 281, 227 USPQ 657; Ex parte Levengood, supra. Both the suggestion to make the 
claimed composition or device or carry out the claimed process and the reasonable expectation of 
success must be founded in the prior art, not in applicant's disclosure. In re Vaeck (CAFC 1991) 
947 F. 2d 488, 20 PQ. 2d 1438. Citing references which merely indicate that isolated elements 
and/or features recited in the claims are known is not a sufficient basis for concluding that the 
combination of claimed elements would have been obvious, Ex parte Hiyamizu (BPAI 1988) 10 
PQ. 2d 1393, absent evidence of a motivating force which would impel persons skilled in the art to 
do what applicant has done. Ex parte Levengood (BPAI 1993) 28 PQ. 2d 1300. The references, 
viewed by themselves and not in retrospect, must suggest doing what applicant has done. In re 
Shaffer (CCPA 1956) 229 F. 2d 476, 108 USPQ 326; In re Skoll (CCPA 1975) 523 F. 2d 1392, 187 
USPQ 481 . The Examiner has not demonstrated teachings to combine Wilson and Whitby. 
Moreover, even were the two publications reasonably combined, they would not yield the present 
invention. 

Whitby and Wilson do not support on obviousness rejection. Putting the teachings of the 
2 together would get IMPDH type II in a protozoon, not IMPDH Type I in a bacteria. 

An inhibitor for a eurkaryote such as would result from Whitby, would inhibit a 
mammalian host, so if a bacterial binding pocket were not known, an inhibitor developed from 
Whitby, therefore would not fit the goal of the invention. Dr. Collart's bacterial model was 
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needed. Bacteria with sequences similar to streptococcus, and using molecular replacement 
models, would result in lead inhibitors based on bacterial binding pockets - pockets that Dr. Collart 
found differed from mammalian ones. 

IX, Other Issues 

A new title is provided in an amendment to the specification. Applicant notes that the 
examiner uses "bacterial IMPDH" therefore it is unclear where this term is said to be "confusing". 
Corrected drawings are enclosed. 

Contrary to the examiner's comment, on the question marks do appear FIG. 2a. There are 
2 singular ones. If the examiner wishes, we can amend page 4, lines 30-3 1 from "????" to "?" 

X. Summary and Conclusion 

For the reasons stated above, applicant requests allowance of all pending claim 3, 
previously said to be allowable if written in independent form claim 4, which was allowed 
previously, claim 2 that depends on claim 3, and claims 6 and 7 only the bad the numbering of 
amino acid positions confused. This confusion should now be straightened out. 

SEQ ID NO. 23 is based on Table 7 and does not have the same amino acid positions as in 
the specification and claims or in the IMPDH in the PDB. As stated on Table 7, not all amino acid 
positions are shown. The position numbers in the rest of the specification and claims, refer to the 
positions the full IMPDH amino acid sequence as in the Protein Data Bank (Exhibit C) (page 22 
of the specification). 

Please contact applicants' representative if you have any questions. 

No other fees are believed due at this time, however, please charge any deficiencies or 
credit any overpayments to deposit account number 12-0913 with reference to our attorney docket 
number (21416-90042). 
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ABSTRACT: The DOCK program explores possible orientations of a molecule 
within a macr omolecu 1 ar active site by superimposing atoms onto precompiled 
site points. Here we compare a number of different search methods, including 
an exhaustive matching algorithm based on a single docking graph.' We evaluate 
the performance of each method by screening a small database of molecules to a 
variety of rnacromolecular targets. By varying the amount of sampling, we can 
monitor the rime convergence of scores and rankings- We not only show that the 
site point— directed search is tenfold faster than a random search, hut that the 
single graph matching algorithm boosts the speed of database screening up to 
60-fold. The new algorithm, in fact, outperforms the bipartite graph matching 
algorithm currently used in DOCK! The results indicate that a critical issue for 
rapid database screening is the extent to which a search method "biases run time 
toward the highest-ranking molecules. The single docking graph matching 
algorithm will be incorporated into DOCK version 4.0. © 1997 by John Wiley &l 
Sons, Inc. / Comput Chem 18; 1175-1189, 1997 

Keywords: automated molecular docking; 3D database search; molecular 
recognition; structure-based drug design 



Introduction 

With the advent of high resolution X-ray 
crystallography and NMR, structural 
chemists and "biologists can study biomacromolec- 
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ular interactions in atomic detail. This information, 
combined with computational and visualization 
tools, has helped spawn the held of structure-based 
ligand design. A common step in the design cycle 
is the process of molecular docking, in which pos- 
sible binding geometries of a molecule -with" a 
maccomolecule are studied. 

The docking process can be divided into two 
parts: a search algorithm and a scoring algorithm. 
The search algorithm should sample the degrees of 
freedom of the ligand:macromolecule system suffi- 
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eiently to include the true binding modefe). The 
scoring algorithm should represent the thermody- 
namics of interaction sufficiently to distinguish the 
true "binding modeCs) from all others explored. 

Because of the computationally expensive na- 
ture of the search problem, many different solu- 
tions have been proposed. Docking with molecular 
dynamics and Monte Carlo algorithms has been 
explored, 1 ^ 2 including simulated annealing 3 ~ s and 
MCSS 6 methods. Other docking protocols consider 
molecular flexibility, including rotamer search/ -9 
distance geometry, 10 and genetic algorithm 11-13 
methods. To make the search tractable for process- 
ing a large set of molecules, the molecular compo- 
nents are often treated as rigid objects- With this 
approximation, researchers have used systematic 
searching, 14 pattern recognition, 15 " 17 graph theo- 
retical/ 8 ~ Z1 and other superposition 22 techniques 
to dock molecules. 

The UCSF DOCK program belongs to the group 
of methods employing the rigid body assumption 
and uses graph theoretical techniques. Because of 
its speed, the program is often used to screen a 
large database of molecules, selecting potential 
ligands of a receptor target 23 In this article, the 
term "ligand" is used loosely; it refers to any 
small molecule whose binding is under study. The 
term "receptor" refers to the macromolecule -whose 
binding pocket is being explored. 

The core of the DOCK search algorithm Is the 
s up erimp o sition of ligand atoms onto predefined 
site points 2 '^ 23 that map out the negative image of 
the binding site. A matching process is used to 
determine which ligand atoms and site points are 
to be superimposed? 6 Multiple orientations are 
generated this way, with each receiving a score 
assessing the intermolecular interactions. This score 
is based on the intermolecular terms of a molecu- 
lar mechanics force field. 27 Recently, an optimiza- 
tion procedure has been added that adjusts each 
orientation to improve the intermolecular interac- 
tions. 28 ' 29 

In this work, we critically evaluate several 
matching algorithms for the docking process, in- 
cluding an exhaustive matching algorithm. The 
exhaustive algorithm was presented by Bron and 
Kerbosch as a method to detect cliques in an 
undirected graph, 30 It was later incorporated into a 
docking algorithm by Crippen and coworkers. 10 ' 19 
It has many attractive features, so we chose to 
evaluate it in the context of rigid molecular dock- 
ing, score optimization, and database screening. 
As an exhaustive search, it avoids some of the 
artifacts encountered by the current matching 



method. For example, with a nanexhaustive algo- 
rithm, adjusting parameters to increase the total 
amount of sampling can reduce the amount of 
sampling of certain binding modes. 25 Increased 
sampling, with the new algorithm, will always 
retain binding orientations found with less sam- 
pling; leading to a proper superset of binding 
modes. An exhaustive algorithm also does not 
require the additional parameters corLtxollixig the 
heuristics of the non exhaustive search. 26 

Although the matching algorithms formally treat 
the ligand and receptor as rigid objects, they can 
readily be incorporated into a flexible docking 

scheme* 10 * 21 ' 31 " 33 In future work, we will investi- 
gate how best to divide up a flexible docking 
problem into smaller rigid parts, 

To evaluate the performance of the new match- 
ing algorithm, we propose a new assessment pro- 
tocol based on screening a small database of 
molecules. Because we specifically intend to mini- 
mize any artifacts due to the quality of the scoring 
function in this work, we will not use experimen- 
tal measurements as the standard, hut instead, the 
global rruriimum of our current scoring function.. 
We will also evaluate random, and partially ran-* 
dom, search algorithms as controls with which to 
put the current DOCK performance in perspective. 
These control algorithms let us investigate funda- 
mental issues of orientations! sampling, such as 
the effect of using site points to guide the search. 



Methods 

BIPARTITE DOCKING GRAPH 

Since the first release of DOCK, the search pro- 
cess has been driven by a matching procedure in 
which, subsets of ligand atoms and receptor site 
points are identified that have equivalent internal 
distances. 15 Matching is formulated as a graph, 
theoretical problem in which the ligand atoms and 
receptor site points are separate sets of nodes in a 
bipartite graph. 26 A match is defined as a set of 
compatible edges which connect a subset of ligand 
nodes with an equal number of receptor nodes. For 
the edges to be compatible, the distances among 
ligand nodes must map to equivalent distances 
among receptor nodes. An example of match for- 
mation is depicted in Figure 1. As this figure 
illustrates, to extend a match, all possible edges 
(including bad edges) must be considered; dis- 
tance comparisons are used to identify and discard 
bad edges. Matches are extended until there is a 
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sufficient number of nodes in the match to define a 
unique orientation of the Iigand. 

Since version 2.0, DOCK avoided considering 
some bad edges with a pruning method involving • 
distance binning. 26 Nodes were preorgantzed in 
distance bins, such that, for each seed node, sets of 
nodes in discrete distance intervals from the seed 
were identified. These bins guide match extension 
from a seed edge (connecting seed nodes), ensur- 
ing that candidate edges are compatible with the 
seed edge. They do nofc> however, ensure compati- 
bility with other nonseed edges already included 
in the match. For example, the binning algorithm 
would avoid considering the bad edge in step H of 
Figure 1, but not the bad edge in step HL The 
storage requirements for this algorithm grow as 
N n N b N n/b <> N*, where N n Is the number of nodes 
Oigand atoms or receptor site points). N 6 is the 
number of distance bins and grows with the longest 
distance and the inverse of the bin width. N n/b is 
the number of nodes in each distance bin which 
grows with N n and the bin width. 

SINGLE DOCKING GRAPH 

Kuhl et a], proposed merging the bipartite dock- 
ing graphs into a single docking graph, 19 which, is 
then amenable to clique detection techniques de- 
veloped by Br on and Kerbosch. 30 In a single dock- 
ing graph, each node represents a pairing of an 
atom with a site point Each edge identifies adja- 
cent nodes, or two nodes for which both atom 
components and site point components are sepa- 
rated by equivalent distances. The docking graph 
is represented by an adjacency matrix in which 
each nonzero element identifies adjacent nodes. 
An adjacency matrix for the example depicted in 
Figure 1 is presented in Figure 2. The chief advan- 
tage with this representation is that all necessary 
distance comparisons are made during the con- 
struction of the adjacency matrix. Consequently, 
during matching, the adjacency matrix is used as a 
rapid filter to ensure that no bad edges are ever 
considered. This type of matching is presented in 
Figure 3. Although the single docking graph is one 
step removed from the intuitive appeal of the 
bipartite docking graph, it enables a more efficient 
solution to the docking problem. 

The single docking graph representation has 
also been implemented in the FLOG docking pro- 
gram. 21 This program heavily prunes the matching 
search tree using a minimum-residual search 
heuristic. Although it examines all possible nodes 
* at each branch point, it only pursues, the node with 
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FIGURE 1- Bipartite graph matching algorithm. The 
receptor arte points Ca-0 and iigand atoms 0-4) are 
separate sets of nodes in a bipartite graph. 28 I. A first 
(saed) edge is considered. Of the 20 seed edges to be 
tried (5 points* 4 atoms), we first consider A4 for this 
example, In three dimensions, such a match would 
superimpose atom 4 onto site point A This match would 
fbc three of six orientational degrees of freedom. IL 
Second edges are considered- Of the 12 edgles to be 
tried (4 points left*3 atoms left}, we consider E3, 
Because AE > 43, we discard E3 as a second edge. 
Then we consider E1 - Because AE = 41 , we retain E1 as 
a second edge. In three dimensions, this match would 
superimpose atoms 4 and 1 onto points A and E, 
respectively. This match would fix two more orierrtational 
degrees of freedom. HI. Third edges are considered. Of 
The six edges to be tried (3 points left* 2 atoms left), we 
consider C3. Though AC = 43, EC < 13 so we must 
discard C3 as a third edge. Then we consider B3. 
Because AB = 43 and EB - 13, we retain B3 as a third 
edge. .This match fixes the last of six orientational degrees 
• of freedom. IV. The match is large enough to define a 
unique orientation which superimposes atoms 4, 1 , and 
3 onto site points A, E, and B, respectively. 



the smallest difference in Iigand and receptor dis- 
tances -with respect to the most recently added 
node in the match. Backtracking is only allowed at 
the seed level, where all possible nodes are pur- 
sued to initiate matching. As a result, the total 
number of matches can never exceed the number 
of nodes, 

Here we propose implementing the single dock- 
ing graph representation combined with a varia- 
tion of the exhaustive clique detection method 
discussed by Bron and Kerbosch. 30 A clique is 
defined 4 as a set of fully adjacent nodes Q.e., a 
completely connected subgraph) which cannot be 
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FIGURE 2, Adjacency matrix for single docking graph 
algorithm. This matrix identifies all adjacent nodes for the 
example given in Figure 1. Each node Is defined as a 
site point-ligand atom pair, for instance, A4* For two 
nodes to be adjacent, the intra^atom distance must be 
equal to the intra-site point distance. For example, matrix 
element (M, E1) is turned on because AE=41. The 
matrix is symmetric. 



further enlarged without adding a nonadjacent 
node. Much attention is given to the intractable 
nature of the maximum clique problem. It is classi- 
fied as NP-complete because the solution time 
grows faster than any polynomial expression of 
the problem size. 19 For the application of molecu- 
lar docking, however/ we axe not hying to find the 
single, largest clique- The process of matching is in 
fact a process of finding completely connected 
subgraphs within an undirected graph: a less re- 
strictive, and therefore more tractable problem than 
finding cliques and maximum cliques. Although 
Bran and Kerbosch actually present two methods 
and recommend a bounding technique for clique 
detection, we find their original, brute-force 
method sufficient for finding fully connected sub- 
graphs in a manner efficient for molecular dock- 
ing. 

MATCHING PARAMETERS 

In our molecular docking implementation, we 
use two parameters to determine node adjacency: 
a distance tolerance and a distance rninimurrL The 
distance tolerance parameter addresses experimen- 
tal uncertainty in the ligand and target structures. 
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FIGURE 3. Single docking graph matching algorithm. 
This figure depicts the entire search path-starting from - 
the same seed used in Figure 1- Matching begins with 
the match set, M Ql empty and the adjacency set, A Qt 
maximally filled, A match is extended by finding the 
union of the previous match> and a branch node, 

b n , which Is selected from A„_ v The new adjacency 
set A n > is the intersection of A /T _ 1 with the set of all 
nodes adjacent to b n , which is A h and is taken from the 
adjacency matrix presented in Figure 2- Match extension 
continues until A n is empty. Matches with three or more 
nodes define a unique ligand orientation (see text and 
Fig. 4). The match set, {A4 B3 El}, corresponds to the 
solution presented in Figure t. 



The distance minimum parameter enables the 
search to focus on the longer, more relevant inter- 
nal distances. If adjacency ixvf carnation is stored in 
a matrix, then, for most docking situations, the 
matrix can be very large. The matrix size grows as 
(Nf /jf N rcc ) 2 , where N Ug is the number of nonhy- 
drogen ligand atoms and N rcc is the number of 
receptor site points. Because these matrices are 
sparse (ca. 1% elements typically occupied), we 
store only the nonzero elements of each row of the 
matrix as ah integer list of nodes. The probability, 
p onJ of an element being nonzero is a function of 
the distance tolerance and distance iruniimim. The 
memory requirement for the adjacency lists is 
Pan(N Iig N rc1 ) 2 . We presort each adjacency list so 
that the process of finding the common elements 
of two lists (the A n - A n _ 1 O A b steps in Fig. 3) 
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can be performed cm. a once-through basis at a 
speed comparable to the use of the complete ma- 
trix. 

Hue advantage of an exhaustive algorithm is 
that, when sampling is increased, the search is 
guaranteed to include the search space explored at 
the lower sampling level This property helps to 
avoid sampling artifacts encountered with, the bi- 
partite xriatchirtg algorithm, 25 Despite its exhaus- 
tive nature, it does not undergo a combinatorial 
explosion for larger systems because of user con- 
trol over the sampling parameters. Typical sam- 
pling parameters for a docking scenario having 
fewer than 3 0 ligand nonhydrogen atoms and fewer 
than 50 target site points are: four nodes minimum 
for a match, and 0.5-A distance tolerance and 2.0-JL 
distance minimum for node adjacency. Because the 
search never explores invalid branches, search time 
grows as a function of the number of distance-con- 
strained solutions rather than the number of possi- 
ble unconstrained solutions. Therefore, docking 
larger molecules into larger sites can be made 
nearly as rapid if a smaller distance tolerance or 
larger distance minimum is chosen Because mem- 
ory reserved for the adjacency lists is dynamically 
allocated, the memory burden is adjusted as well. 

MINIMUM MATCH SIZE 

Some confusion exists in the literature over how 
many atoms and site points must be in a match to 
define a unique orientation. The orientation is gen- 
erated by a ligand transformation which has a 
translation and a rotation component. The transla- 
tion vector has three degrees of freedom- The rota- 
tion matrix has four degrees of freedom. Three of 
them are represented by the Eular angles. The 
fourth is represented by the sign of the determi- 
nant. A rotation with a positive determinant re- 
tains the handedness of the object it transforms, 
while one with a negative determinant will reverse 
the handedness of the object. If one knows in 
advance whether to reverse the handedness of the 
object, then only the three Euler angles need to be 
determined, For example, when docking a crural 
ligand in which only one stereoisomer is relevant 
(e.g., protein or peptide ligandsX only the poslttve- 
deterrriinant rotation matrix would be of interest 
When docking a ligand available as a racemate, 
then both tr ans formations would be of interest 
The FLOG program, 21 for instance, routinely sam- 
ples both mirror images a ligand, even when the 
ligand is achiral. When the sign of the determinant 
is known in advance, the six degrees of freedom of 
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FIGURE 4, Relative chlraiity of match points. (A) 
Distance matching might identify a set of atoms {12 3 
4} to match with a set of site points {ABC D}, such 
that A is with 1, B is with 2, and so on- Although the 
interna] distances within the atoms are equivalent to 
those within the site points, the handedness is opposite, 
(B) We define the relative chlraiity, O, according to the 
sign of the triple product For any given four-point match, 
the probability that the relative chiralrties are the same is 
50%. (C) If the chirally opposed sets are superimposed 
without inverting the chlraiity of the ligand set, then the 
resuftlng least squares fit is poor. 



the rotation and translation are uniquely deter- 
mined by a match set containing three rLQinlinear 
atoms and site points. 

Processing a larger match set causes only one 
transformation to be allowed. As illustrated in 
Figure 4, when a oxie- to-one mapping has been 
made between two sets of four nonplanar points, 
then each set can be assigned a relative chlraiity. 
This is true even if the points come horn, an achiral 
molecule or from a set of site points where chiral- 
ity is ambiguous. If the relative chiralities are the 
same, then the ligand can be oriented normally. If 
the relative chiralities are opposite, then either the 
ligand is inverted when oriented, or if that is not 
desired, then the match is discarded. In fact, all 
larger matches that are supersets of the discarded 
match are also discarded, because they too yield 
inconsistent matches. If these steps are not taken, 
then the resulting orientations will poorly super- 
impose the ligand atoms and receptor site points 
in the match set, even though all the distance 
tolerances are met (Jig- 4C). 

ADDITIONAL SEARCH CONSTRAINTS 

The systematic design of the matching algo- 
rithm makes it well suited to incorporate special- 
ized search constraints. Some exarriples, although 
not assessed in this study, are mentioned because 



JOURNAL OF COMPUTATIONAL CHEMISTRY 



1179 



EWING AND KUNTZ 



■they have been shown to be useful elsewhere. To 
avoid over samp ling particular binding modes, ori- 
entational degeneracy checking has been stud- 
ied. 17 ' 29 In the new matching algorithm/ a. degener- 
ate orientation is detected as a degenerate match 
whose nodes axe a subset of nodes in a larger 
match. In other words, only subgraphs that are 
true cliques need be processed. As another exam- 
ple, chemical information can be used to guide the 
matching process using labeled atoms and site 
points. Only nodes composed of a chemically com- 
patible atom and site point are used to seed or 
extend a match. Much like the repellent node im- 
plementation of Kuril et al., 19 matches adjacent to 
chemically incompatible nodes are discarded. In 
addition, sampling can be focused on particular 
regions of the active site by defining critical site 
point clusters. This technique is similar to the 
approach used in targeted DOCK 51 and FLOG, 21 
except that clusters can be of arbitrary size and 
number. The matching process automatically re- 
stricts itself to make sure all matches include 
members from each duster- The new matching 



algorithm lends itself so well to these constraints 
that when activated, they contribute a negligible 
computational overhead, and can lead to consider- 
able speed improvements for database search- 
es. 21 ' 3 * 

CONTROL METHODS 

We will test a total of five methods (Table I) to 
isolate specific aspects of the search process. These 
methods range in complexity from a completely 
random search to the bipartite and single graph 
procedures described previously* We begin with 
the uniform random transformation (URT) method, 
which explores a predefined rectangular volume 
enclosing the active site. It is the most simple and 
"hypo thesis-free" of the methods tested here. URT 
will indicate the minimum level of performance 
that we expect from any docking algorithm. The 
uniform random matching (URM) method ex- 
plores the irregularly shaped volume described by 
the collection of site points. It will show the perfor- 
mance gains, if any, of using a "negative image" 



TABLE S. 

Search Methods. 



Abbrev, Method Description Hypothesis Tested 



URT 



URM 



BRM 



BGM 



Uniform 
random 
transformation 



9 



Uniform 
random 
matching 



Biased 

random 

matching 

Single 
graph 
matching 



Bipartite 

graph 

matching 



Construct rectangular volume enclosing 
srta 

Randomly move molecule center of 
mass within volume 
Randomly rotate 

Each molecule In database sampled 
uniformly 

Match random subsets of atoms with 
random subsets of site points 
Superimpose match atoms onto match 
site points with a least squares fit 
Each molecule in database sampled 
uniform iy 

Match randomly (Ilka URM) 

SGM controls amount of sampling for 

each molecule in database 

Using single graph, exhaustively 
match subsets of atoms with subsets 
site points with equfvalent internal 
distances (DOCK 4.0) 

Using bipartite graph, nonexhausitfvely 
match using binning algorithm 
(DOCK 3.5) 



Random method used as reference 
Rectangular enclosure is sufficient 
Srte point description Is unnecessary 



An irregular volume to describe the 
site Is more efficient 



An irregular volume is more efficient 
Spending more time on molecules 
which match better Is more efficient 

Fitting some atoms precisely onto 
of some site points is more efficient 



Fitting some atoms precisely onto 
some site points Is mpre efficient 
Binning algorithm more efficient 
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approach to map out the binding site. The biased 
random matching Q3RM) method is identical to 
UjRM, except it uses the new matching algorithm 
to determine the number of random matches to try 
for each molecule. Once the number of matches 
has been determined, BRM uses completely ran- 
dom selections of nodes to form the actual matches 
used to generate molecule orientations. Because 
BRM is a hybrid approach, it is meant to help 
isolate the source of any differences between TJRM 
and the new matching algorithm. Trie single graph 
matching (SGM) algorithm uses the new matching 
algorithm to determine both the number of matches 
and the actual orientations to try for each molecule. 
SGM will reveal the performance gains, if any, of 
uaing site points to not only map out the most 
interesting binding site volume, but to also direct 
the positioriing of individual ligand atoms. The 
bipartite graph matching (BGM) method is the 
existing DOCK 3.5 matching algorithm- It will re- 
veal the advantage^ if any, of using a nonexhaus- 
rive search method with a longest distance first 
heuristic. BGM is described last because, within 
the spectrum of different search methods, its algo- 
rithm is the most elaborate. 

TEST SYSTEM 

We assess the performance of the search meth- 
ods in the following way- We dock a set of 100 
molecules, chosen randomly from the set of un- 
charged, medium-sized, and generally rigid 
molecules in the available chemicals database 
(ACD). 33 In our study, a medium-sized molecule 
is one with 15 to 35 nonhydrogen atoms. A gener- 
ally rigid molecule is one with no single bonds 
except those attaching hydrogen atoms, attaching 
terrninal nonhydrogens G^e, methyl or -hydxoocyl 
groups), or partidpating in ring structures. Mole- 
cules meeting these three criteria compose 40% of 
the ACT). For each molecule in the test set a single 
CONCORD^-gener ated conformation is used. 

We see several advantages to using such a data 
set of molecules to test the search methods. First, 
the docking conditions represent a dose approxi- 
mation to the typical application of DOCK to 
database screening. Not only can we study the 
convergence of score for each molecule, we can 
study the convergence of relative scores, or rank- 
ings, of the set of molecules. Second/ the docking 
conditions allow us to explore a multitude of di- 
verse molecular shapes so that our results are less 
subject to potential artifacts of a particular ligand: 
receptor system. Although some may argue that 



studying a set of known, potent ligands would be 
more relevant, we counter that the databases 
DOCK searches often do not contain potent 
binders, and that DOCK frequently finds micromo- 
lar inhibitors to serve as lead compounds. 23 By 
choosing a random subset of molecules, we, in 
fact, will arrive at a set that best represents the 
typical array of molecules tested. By biasing the 
subset to include medium-sized, generally rigid 
molecules, we also focus on that portion of the 
database which is best treated by rigid molecular 
dockings 

For each search method, we perform multiple 
docking runs, and vary the amount of sampling 
from zero to a value at which the docking results 
converge. The key sampling parameters for each 
method are listed in Table II. Scoring and opti- 

TABLH 1L 

Sampling Parameters, 



URT 

Total orientations 

UHM 
Total orientations 
Nodes m in /max 

BRM 

Distance tolerance 
Distance minimum 
Nodes min/max 

SGM 
Distance tolerance 

Distance minimum 
Nodes min/max 
BGM 

Distance tolerance 

Ligand bin w[dth 

Receptor bin width 

Ugand bin overlap 

Receptor bin overlap 
Nodes min/max 



0-900.000" 

0-1 00,000 n . 
3/3 b 

0-0.9 A a 
3/3 b 

0-0.6 A a 

2.0 A c 
4/1 0 d 

1 .5 A e 

0.1-0,9A a ' f 

0.1-0.9 A'-' 

0.1-0.5 A B,f 

0-1-0.5 A**' 
4/4° 



Q Pqt some test systems, the up par limit was not reached if 
docking results converged eariy. 

b Set to three, because, as tfie size of a random match 
increases, the least- squares superposition procedure in- 
creasingly biases the orientation -toward the centrcsd of the 
stte points. 

c set large enough to exclude atoms sharing a covaierrt 
bond. 

d Minimum of four chosen so that chirality could be uaed to 
fitter matches- Maximum of ten is somewhat inconsistent 
with value chosen for BGM, but we presume any affects of 
this would be small. 
• Chosen as historical default 

1 Minimum value not zero because of a numerical Instability 
of the algorithm. 
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nuzation parameters are listed in Table IH. The 
tange of values in these tables correspond, to tim- 
ings of less than 0.1 second /molecule to more than 
100 seconds/molecule on a modern workstation 
Csee subsequent text). 

For each docking run, several properties are 
computed that compare the results from any spe- 
cific run to the best results from all runs combined 
(assumed to contain the global rnirumum)* These 
properties are summarized in Table IV. When these 
values are plotted versus time, the convergence of 
each property can be monitored. We assume that a 
better search algorithm will lead to more rapid 
convergence. 

When considering the average behavior of each 
property, we compute both the usual mean and 
also a rank-weighted mean. The rank-weighted 
mean is more sensitive to the behavior of the top 
scoring molecules, which are of most interest in 
database s opening runs. Although many kinds of 
weighting functions could be chosen for this pur- 



TABLE III. 

Scoring and Optimization Para meters. 



Type 



Force field 



Bump maximum 
Dielectric 

Grid spacing 
Interpolation 

Convergence criteria 
Maximum iterations 



3 a 

0.3 A 
Trilinear 

0.1 kcal/mol b 
500 c 



8 Nonzero maximum allows some orienhalioris with limited 
Van dar Waais dash as with the receptor to be recovered by 
the rninimizer. 

A relatively light convergence criteria was selected to re- 
duce noise Jn the score evaluation, bo tnat differences be- 
tween methods were more directly attributable to differences 
In sampling, The rank correlation would be especially vulner- 
able to such noise. 

C A large Iteration llmrt was also selected to reduce noise Jn 
the score evaluation by preventing the mini mizer from termk 
nating prematurory. 



TABLE IV. 

Comparison Methods. 



Method 



Average 

relative 

score 



Definition 



For each molecule in the 
docking run, normalize 
its score by the best 
score It ever received 
In trie site 
Then compute the 
average over the 
molecules in the set 



Equation 



H 



/- 1 

N 



Rank weighted 
equation 



E 



* 1 



Range 



[0, 1]; unitlass 



Rank 

Correlation 



Assign a rank, y h to 
each molecule based 
on fts best score in 
the docking run 
Then correlate y, with 
the rank of each molecule 
based on the best score 
It ever received in 
the site, x y 



E U, - « (y, - ?) 

LU,-x) 2 



N 



L T U,-x)(y,-F) 



[-1, 1]; unitless 



N \ 



Average 
RMS . 
error 



For each molecule in the 
docking run, compute 
the RMS error ot its 
predicted orientation 
compared to that which 
received the best score 
for that site 
Then, compute the 
average over the 
molecules in the set 



N 

N 



T 

£ 1 

E y 



[0, «]; in angstroms 
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TABLE V. 

Receptor Structure from P PB 37 3I> U&ed fof Test Syotema. 

Code 



121D 



1 UUB 



3DFR 



4FAB 



9HVP 



Structure 



DNA dodecamer 
with Netrospin 39 

Purine nucleoside 
phosphorylase 
with guanine 40 

Dihydrofolata 
reductase with 
MADPH and 
methotrexate 41 

Fab fragment with 
fluorescein* 2 . 

H1V-1 protease 
wfth A-74704 43 



Resolution 



2.2 A 



2.75 A 



1.7 A 



21 A 



2.B A 



R factor 



0.198 



0.204 



0.152 



0.215 



0.182 



Site description 



Site Is broad, presenting two continuous 
binding sites in the major and minor grooves 
of the DNA dodecamer; highly polar 

Site has two pockets; one Is broad and centrally 
located, another where actuaJ ligand binds 
Is peripheral and solvent excluded 
Site has a deep, centrally located binding 
pocket; mixed polar and nonpolar regions 



Site is shallow with three pockets formed by 
the six hypervariable loops: generally 
nonpolar 

Srte Is a long, narrow tube, which completely 
penetrates protein; mixed polar and 
nonpolar regions 



pose, we chose to use the reciprocal of the rank for 
convenience. 

To make sure that our conclusions are general- 
izable, we analyzed the methods using five differ- 
ent receptor sites listed in Table V. These sites 
were chosen from the list of complexes of high 
resolution, well-refined structure with a ligand 
having a well-defined binding position. They were 
also chosen based on very different-shaped bind- 
ing sites- The chief features of each site are shown 
in Table V. 



Results and Discussion 

DOCKING CONDITIONS 

The test cases were prepared for docking in the 
standard way. Site points were constructed using 
the sphere generation accessary program of DOCK 
with default parameters, 13 We selected the cluster 
of site points which occupied the binding site of 
the actual ligand in the crystal complex. Within 
this cluster, we merged the positions of tightly 
grouped site points using a 2-A cutoff. The final 
- number of site points used for each receptor ranged 
from 30 to 60. 

RESOURCE USAGE 

All docking calculations were performed on Sil- 
icon Graphics Indigo 2 workstations, equipped with 



200 MHz R4400 processors and 128 MB of RAM, 
so tunings are consistent among the different 
methods. Several weeks of computer time were" 
required to complete all runs. AH methods re- 
quired approximately 13 MB of RAM to store the 
scoring grids. The URT and TJRM methods re- 
quired negligible additional memory for matching 
and orienting, BRM and SGM required up to 0.1 
MB of HAM for notching arrays. BGM required 1 
MB of RAM for matching arrays. 

TEST SYSTEMS 

Selected results for the 3DFR test system are 
presented in Figures 5, 6, and 7 to illustrate the 
type of data we collected.. As shown in Figure 5, 
the weighted average score generally converges 
asymptotically to an optimum as sampling in- 
creases. The scores from all matching methods 
converge to within 90% of the optimum in about 
10 seconds /molecule, whereas the URT meth- 
od requires about 100 seconds /molecule. The 
weighted rank correlation in Figure 6 also shows 
convergent behavior, hut with some interesting 
differences. It goes through much -wider fluctua- 
tions, indicating that small changes in score have 
large effects on the rankings of the top scoring 
molecules. It appears to discriminate among the 
different methods, selecting BRM and SGM as 
superior, BGM and TJRM as next best, and URT as 
worst again. In particular, BRM and SGM both 
show a rapid initial rise, indicating that, with very 
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FIGURE 5. Weighted average score for molecules 
docked to 3DFH using SPHGEN site points. Each curve 
represents a search algorithm In Table LEach data point 
Is a weighted average of the score for all molecules in a 
particular run using the equation in Table IV. 



little sampling, these methods come closest to pre- 
dicting the rankings of the top scoring molecules. 
The convergence of weighted. RMSD in Figure 7 
indicates how long it takes the different methods 
to predict reproducibly the same binding mode of 
the top scoring molecules. Although the two top 
performing methods converge in predicted score 
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FIGURE 6. Weighted rank correlation for molecules 
docked to 3DFR using SPHGEN site points. Each data 
point is an average of the rank correlation for ail 
molecules in a particular run using the equations in 
Table IV. 
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FIGURE 7. Weighted average RMSD for molecules 
docked to SDFR using SPHGEN site points. Only the 
two best search algorithms from Table 1 are presented. 
Each data point is an average of the RMSD for all 
molecules in a particular run using the equations in 
Table IV. 



and ranking in about 10 seconds, they require 
ah out 500 seconds "before they consistently predict 
the same binding mode. This result indicates that, 
for these molecules in this site, several good scor- 
ing orientations must exist that are close in score 
but distant in space. We found similar results for 
the other sites as welL 

We found the weighted and unweighted forms 
of the average score and rank correlation to be the 
most relevant in assessing database scheming per- 
formance. This gives us four measurements of five 
methods over five sites- Instead of presenting 100 
different curves, we nave condensed each curve 
into a single value — the convergence time — which, 
represents the time at which the 90% threshold 
value is passed (and not recrossed). In Table VI, 
we present the convergence times along with the 
speed improvement factor of each method com- 
pared to the URT method. 

SCORE CONVERGENCE 

With respect to the unweighted average score in 
Table VI(A), all matching methods show roughly 
equivalent convergence, and outperform URT by a 
factor of 10. Therefore, on average, site points 
provide- a much more succinct description of the 
active site than the smallest enclosing box, espe- 
cially when searching a site that is large or difficult 



1184 



VOL 18, NO. 9 




AUTOMATED MOLECULAR DOCKING AND DATABASE SCREENING 



TABLE VI. 

Convergence Properties of Seorcfi Methods,* 



Srte 



URT 



(A) Unweighted Average Score 
121D 200 
1ULB 200 (1x) 

3DFR 100 (1>0 

4 FAB 30 (1x) 

9HVP 200 (1x) 



Mean 



100 (Ix) 



(B) Hank-weighted average score 
121D 4O0 Ox) 
1ULB '1000 (1X) 
3DFR 200 Ox) 
4FAB 90 Ox) 
9HVP 200 (1x) 

Mean 300 Ox) 

(C) Unweighted rank correlation 
121D 400 Ox) 
1ULB 1000 Ox) 
3DFR 400 (1xJ 
4FAB 200 Ox) 
9HVP 700 (1x) 



Mean 



500 Ox) 



(D) Rank-weighted rank correlation 
121D 400 Ox) 

1ULB 1000 Ox) 

3DFR 600 (1x) 

4FAB 400 Ox) 

9HVP 300 (ix) 



Mean 



500 Ox) 



URM 



10 (20x) 
10 (20x) 
9 (1Qx) 

7(4)0 
7(30x) 

9 OOx) 



10 (40x) 
60 (20x) 
20 (9x) 
7 (10x) 
7 OOx) 

20 (20x) 



20 (20x3 
60 feux) 
50 (9x) 
30 (7x) 
20 (40x) 

30 (10x) 



20 (3Qx) 
60 (2 Ox) 
100 (6x) 
10 (30x) 
7 (40x) 

20(20x) 



Search Methods 



BRM 



20 (10x) 
20 (8x) 

10 Oox) 

8(4x) 
5 (40x) 

10 (I0x) 



e (box) 

30 (40x) 
1 0 (20x) 
4(20x) 
2 (100x) 

7 (40*) 



10 (3 Ox) 
40 (20x) 
20 (20xJ 
10 (20x) 
8 (90x) 

20 (30x) 



10 (40x) 
20 (60x) 
10 (4Cx) 
3 Ooox) 
2 (100x) 

7 (70x) 



SGM 



10 (20x) 
20 (10x) 
10 (10)0 
7 (5x) 
5 (40x) 

9 (10x3 



3 OOGx) 
20 (70x) 

8 (30X) 

4 (20x) 
4 (40x) 

6 (50x) 



7 (6Qx) 

30 (30x) 
10 (30x) 
20 (7x) 
10 (60x) 

20 (30x) 



7 (60x) 
20 (70x) 
10 (40x1 

4 (90x) 
4 (60x) 

8 (60x) 



BGM 



8 (30x) 

8 (20x) 
20 (6x) 

7 (4x) 

6 (30x) 

9 OOx) 

8 (50x) 
30 (40x) 
20 (10x) 

7 (1 Ox) 
5 (40x) 

10 (30x) 



10 (40x) 
80 (1 Ox) 
40 (10x)- 
50 (3x) 
20 (50x) 

30 (20x) 



20 (30x) 
80 (20x) 
20 (30x) 
10 (30x) 
3 (80x) 

20 (30x) 



a For each sampling method and receptor site, we find the amount of sampling, in seconds, beyond which all values are within 90% 
of the maximum. With lbs time of URT as the reference, the relative speed factor of each method is reported In parentheses. The 
geometric mean over all receptor arts values is reported et bottom- Because of the large uncertainty in these values, all values are 
rounded to one significant digit. 



to define a -priori. Because Hie site points were 
generated based cm general considerations of 
shape, this result should be generalizable to other 
"negative image" techniques, like the shape-based 
critical point methods of Nussinov and Wolfs en, 44 
and the energetic probe methods of Goodford. 45 
The rank-v/eighted average score in Table VT(B) 
shows more discrirrunation among the notching 
methods. While the two uniform random methods 
(URT and UEM) had more difficulty converging 
with the top-scoring molecules, BRM and SGM 
actually converged more quickly. This implies that 
trying a irniform number of orientations for the 



molecules in the database is inefficient with re- 
spect to processing the top-scoring molecules, BGM 
performed better than URM, but not as well as 
either bRM or SGM. 

RANK CONVERGENCE 

The convergence of the rank correlation further 
confirms the differences between methods- The 
unweighted rank correlations in Table VT(C) again 
show a general 10- to 30-fold advantage of using 
site points to dock the molecules. Interestingly, the 
time required to get the rank correct is two- to 
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fivefold greater than the time to get the average 
score [Table VlCA)] correct The weighted rank 
correlations in Table VICD) also further discrimi- 
nate among the methods. UEM and BGM still 
show the 20- to 30-fold advantage over URT. BEM 
and SGM again outperform URT by 60- to 70-fold. 
Of all measures, this last one is arguably the most 
relevant to database screening, because absolute 
scores are generally not used as strict cutoffs, but 
instead the rankings are used to select some sub- 
group of molecules whose number is amenable to 
further processing. Most often, it is the top-scoring 
subgroup of most interest, so using a rank- 
weighted correlation should focus our attention on 
how the methods treat this particular subgroup of 
molecules. Therefore, it appears that, of the meth- 
ods investigated here, BEM and SGM are the best 
suited for database screening. 

BGM VERSUS SGM 

The single graph matching method clearly out- 
performs the existing bipartite graph matching 
method by up to twofold in speed- This result may 
appear counterintuitive^ because SGM is exhaus- 
tive, whereas BGM uses heuristics to speed the 
search. However, the precomputing of the adja- 
cency matrix and the rapid processing of the adja- 
cency lists show that reformulating the problem 
into an efficient form can be just as effective as 
using heuristics- SGM has the additional advan- 
tage of requiring fewer fundamental matching pa- 
rameters than "BGM (Table H). 

BRM VERSUS SGM 

Why does a simplistic random matching algo- 
rithm, BRM, perform so competitively with the 
distance matching algorithm, SGM, and even out- 
perform BGM? Does this suggest that distance 
matching is an unnecessarily complicated solution 
to docking? We seek to resolve this question by 
breaking the problem into three parts. 

1« What is the disadvantage of sampling each 
molecule unif ormly? 

2. Why does distance matchirig sample mole- 
cules rumurii f ormly? 

3* Is an curLexitatiorL from random matching just 
as good as one from distance matrhing? 

First, spending the same time on each molecule 
may result in spending too little rime on the better 



scoring molecules. One feature of the force field 
scoring function is that it tends to favor larger 
molecules. 21 For the set of molecules used in this 
study, we have plotted the best score for each 
molecule against its si^e in Figure 8. Although 
there is some trend, the correlation coefficient is 
not large. A stronger trend exists in Figure 9, 
relating molecule size to bump filtering. Large 
molecules have a greater propensity to bump into 
receptor atoms when oriented in the site. Because 
we use a bump filter in DOCK to discard poor 
orientations before the more computationally ex- 
pensive scoring and optimization steps, we are 
more likely to discard an orientation of a large 
molecule than that of a small molecule. Forcing a 
uniform number of matches per molecule would 
then result in a size-biased attrition through the 
bump filter and, overall, spending less time on the 
potentially better scoring, larger molecules. 

Secand, nonuniform sampling arises in distance 
matching because the number of matches is re- 
lated to the number of internal distances that the 
ligand has in common with the site points. Larger 
molecules have more internal distances , and so 
will tend to have more in common. Therefore," 
larger molecules tend to generate more matches 
than smaller molecules. However, a molecule that 
is riot larger, but rather similar in shape to the 
binding site, will also have more internal distances 
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FIGURE 8. How molecule size affects force field score. 
The best force field score for each molecule is plotted 
versus the number of nonhydrogen atoms in the 
molecule. The plots for each receptor are pooled Into 
this single plot to show the overall trend.' Frtting a line to 
these data yields an R 2 value erf 0.24. 
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FIGURE 9- How molecule size affects bump filtering. In 
a given docking run, we compute the fraction of 
orientations that pass the bump filter for each molecule, 
and then plot this value against the size of the molecule. 
(Top) BRM method. Average fitter rate is 0,031. (Bottom) 
SGM method. Average fitter rate is 0,36. 



in coramnru Therefore, distance matching will 
spend more -time on molecules that are comple- 
mentary in shape to the receptor. Thus, the BRM 
method, "benefits from using distance matching to 
determine the number of orientations to try simply 
because it will bias its efforts toward the larger 
and/or more complementary molecules. 

Third, an orientation from random matching is, 
on average, not as good as one from distance 
matching. We can check the quality of these orien- 
tations by examining how they survive the bump 
filter as depicted in Figure 9. The orientations from 
distance rriatching are at least ten times more likely 
to make it past this filter/ so they are indeed 



superior. To understand why this superiority does 
not translate directly into faster docking, we must 
consider the computational bottleneck of the cur- 
rent implementation of DOCK. If the early docking 
steps — rriatching, orienting, and bump filtering- 
were to consume a dominant portion of the total 
cpu time, then SGM would be up to ten rimes 
faster than BRM. On the other hanct if the later 
docking steps — scoring and optimization — were 
to consume a dominant portion of the total cpu 
time, then BRM would be equivalent in speed to 
SGM, With the current implementation of the min- 
imizer, the scoring and optimization steps indeed 
consume an overwhelming portion of cpu time 
(> 90%), so the ability of distance matching to 
form high-quality orientations is for now unre- 
warded. BRM is kept competitive because, m con- 
cert with heavy pruning by the bump filter, it 
forms orientations that are suitable starting points 
for optimization. 

■ BRM VERSOS BGM 

The fact that BRM outperforms BGM by up to a 
factor of two points out the critical importance of 
how processing time is allocated among the diffe*- 
ent molecules in the database. For maximum effi- 
ciency, a search algorithm must "bias its efforts 
toward the most highly ranked, molecules, BGM 
indeed has such a bias built in, but the heuristics 
have the effect of reducing the magnitude of bias 
inherent to the unrestrained matching found in 
SGM and BRM> 

FUTURE DIRECTIONS 

It appears sensible to incorporate the single 
graph matching technique into version 4.0 of 
DCOCy because its results are of high quality and 
its potential for sp^ed improvement is high. The 
speed gains might arise from additional orienta- 
tion filtering or by fundamental improvements in 
the opbmizatiori technique. The new matching al- 
gorithm will provide .a solid algorithmic founda- 
tion on which to base further development of 
molecular docking/ including the addition of so- 
phisticated, search constraints like chemical labels 
and critical clusters, as well as the explicit treat- 
ment of ligand flexibility. 

Until a faster scoring and optirivization method, 
is implemented, it may be useful to p reprocess the 
set of orientations generated by matching- For in- 
stance, an implementation of degeneracy checking 
has been tried in which similar orientations axe 



JOURNAL OF COMPUTATIONAL CHEMISTRY 

r 



1187 



EWING AND KUNTZ 



removed at the matching, stage. 29 Because orient- 
ing is also relatively facile, it implies that degener- 
acy checking could instead be performed as an 
RMS deviation calculation between real orienta- 
tions. Rarey et al. present a rapid SMSD evalua- 
tion technique based directly on the rotation 
matrices 17 that would be applicable. It might also 
be possible identify a unique property of orienfca- 
tions that lead to the best scores upon minimiza- 
tion (e.g., degree of surface overlap with receptor). 
We intend to further study the nature of orienta- 
tions generated by BRM that pass the bump filter, 
to see why they manage to score so well upon 
minimisation. Based upon this knowledge, a filter 
could be constructed that enriches the set of orien- 
tations generated by regular matching prior to 
miruirdmtion. 

An additional avenue for improvement would 
be to try alternative methods to generate site 
points. The current shape-based site points may 
not be entirely consistent with a f orce-field-b as ed 
scoring function. 21 Including force-field considera- 
tions during site point construction, might make 
the site points more relevant as points an which to 
position Jig and atoms, thereby increasing the qual- 
ity of docked orientations. 

We are working to extend the current docking 
protocol to include ligand and receptor flexibility. 
Ligand flexibility can be incorporated in several 
ways. The most straightforward approach is to 
dock multiple conformations of the ligand sepa- 
rately^ 1 A potentially more efficient method is to 
use distance geometry to build a ligand conforma- 
tion that fits a subset of receptor site points. 10 
Another viable option is the "divide and conquer" 
strategy, in which a flexible molecule is broken 
into rigid fragments, the fragments are docked 
independently, and the molecule is rebuilt from 
adjacent fragment orientations. 31 " 33 Limited recep- 
tor flexibility is being investigated during the step 
of score evaluation by superimposing multiple re- 
ceptor conf ormers on a single score potential grid. 46 



Conclusion 

We evaluated various search algorithms for au- 
tomated molecular docking that range in complex- 
ity from purely random to site point-directed, 
graph-theoretical matching methods. Our basis of 
comparison was how quickly each docking method 
could correctly score and rank a database of 
molecules- Over a broad range of active site envi- 



ronments, it is at least tenfold more efficient to use 
a collection of site points to describe the active site 
search volume than to use the smallest enclosing 
box. Using graph theoretical matching techniques 
boosts this relative efficiency higher. The bipartite 
graph matching used in the current DOCK version 
35 improves efficiency up to 30-fold. The single 
graph matching proposed for DOCK version 4.0 
improves efficiency up to 60-fold. Because single 
graph matching is not only faster, but also less 
complicated than the bipartite graph matching 
used in version 3.5, we feel it will be an important 
advance in docking technology. 
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ABSTRACT: A novel and robust automated docking method that predicts the 
bound c onf orrrLations of flexible ligands to macxarriuolecular targets has been 
developed and tested, in combination with anew scoring function that estimates 
the free energy change upon bindings Interestingly, this method applies a 
Lamarckian model of genetics,, in which, environmental adaptations of an 
individual's phenotype axe reverse transcribed into its genotype and become 
heritable traits (sic). We consider three search methods, Monte Carlo simulated 
annealing, a traditional genetic algorithm, and the Lamarckian genetic algorithm, 
and compare their performance in dockings of seven protein— ligand test systems 
having known three-dimensional structure, We show that both the traditional 
and Lamarckian genetic algorithms can handle ligands with more degrees of 
freedom than the simulated annealing method used in earlier versions of 
AutoDock, and that the Lamarckian genetic algorithm is the most efficient, 
reliable, and successful of the three. The empirical free energy fimction was 
calibrated using a set of 30 structurally known protein— ligand complexes with 
experimentally determined binding constants. Linear regression analysis of the 
observed binding constants in terms of a wide variety of structure-derived 
molecular properties was performed. The final model had a residual standard 
error of 9.11 kj mol~ l (2.177 kcal mol" 1 ) and was chosen as the new energy 
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function. The new search methods and empirical free energy function are 
available in AxttoDock, version 3.0. © 1998 John Wiley & Sons, Ihc T Comuu t 
Chem 19: 1639-1662, 1998 F 
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Introduction 




fast atom-based computatkinal docking tool 
is essential to most techniques for structure- 
based drug design. 1,2 Reported techniques for au- 
tomated docking fall into two broad categories: 
matching methods and docking simulation meth- 
ods. 3 Maix±iing methods create a model of the 
active site, typically including sites of hydrogen 
bonding and site that are stericaHy accessible, and 
then attempt to dock a given inhibitor structure 
into the model as a rigid body by matching its 
geometry to that of the active site. The most suc- 
cessful example of this approach is Dock/' 5 which 
is efficient enough to screen entire chemical 
databases rapidly for lead compounds. The second 
class of docking techniques model the docking of a 
ligand to a target in greater detail: the ligand 
begins randomly outside the protein, and explores 
translations/ orientations, and conformations until 
an ideal site is found. These techniques are typi- 
cally slower than the matching techniques, but 
they allow flexibility within the ligand to be mod- 
eled and can utilize more detailed molecular me- 
chanics to calculate the energy of the ligand in the 
context of the putative active site. They allow 
computational chemists to investigate modifica- 
tions of lead molecules suggested by the chemi- 
cal intuitioxi and expertise of organic synthetic 
chemists. 

AUTODOO£ 6 ' 7 is an example of the latter, more 
physically detailed, flexible docking technique. 
Previous releases of AutoDock combine a rapid 
grid-based method for energy evaluation, 8 ' 9 p re- 
calculating ligand— protein pairwise interaction en- 
ergies so that they may be used as a look-up table 
during simulation, with a Monte Carlo simulated 
annealing search 10, 11 for optimal conformations of 
ligands. AutoDock has been applied with great 
success in the prediction of bound conformations 
of enzyme— hnhibitox complexes,, 12 ' 13 peptide— anti- 
body complexes, 14 and even protein— protein inter- 
actions 15 ; these and other apphcations have been 
reviewed elsewhere. 16 



We initiated' the current work to remedy two 
limitations of AutoDocx. (i) We have found that 
the simulated annealing search method performs 
well with ligands that have roughly eight rofatable 
bonds or less; problems with more degrees of 
freedom, rapidly become intractable. This de- 
manded a more efficient search method, (ii) 
AutoDock is often used to obtain unbiased dock- 
ings of flexible inhibitors in enzyme active sites; in 
computer-assisted drug-design, novel modifica- 
tions of such lead molecules can be investigated 
computationally. Like many other computational 
approaches, AutoDock performs well in predict- 
ing relative quantities and rankings for series of 
similar molecules; however, it has not been possi- 
ble to estimate in AutoDock: whether a ligand will, 
bind with a mnlimolar, micromolar, or nanomolar 
binding constant. Earlier versions of AtttoDock 
used a set of traditional molecular mechanics 
force-field parameters that were not directly corre- 
lated with observed binding free energies; hence, 
we needed to develop a force field that could be 
used to predict such quantities. 

Molecular docking is a difficult c^timization 
problem, requiring efficient sampling across the 
entire range of positional, orientational, and con- 
formational possibilities. Genetic algorithms (GA) 
fulfill the role of global search particularly well, 
and are increasingly being applied to problems 
that suffer from combinatorial explosions due to 
their marry degrees of freedom. Both canonical 
genetic algorithms 17 "* 21 and evolutionary program- 
ming methods 21 have been shown to be successful 
in both drug design and docking. 

In this report, we describe two major advances 
that are included in the new release of AuroDooc, 
version 3.0, The first is the addition of three new 
search methods: a genetic algorithm; a local search 
method; and a novel, adaptive global— local search 
method based on Lamarckian genetics, the La- 
roarckian genetic algorithm (LGA). The second ad- 
vance is an empirical binding free energy force 
field that allows the prediction of binding free 
energies, and hence binding constants, for docked 
ligands. 
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Methods 

GENETIC ALGORITHMS 

Genetic algorithms 2 * use ideas based on the lan- 
guage of natural genetics and biological evolu- 
tion, 4 la the case of molecular docking, the partic- 
ular arrangement of a ligand and a protein can be 
defined by a set of values describing the transla- 
tion, orientation, and conformation of the ligand 
with respect to the protein: these are the ligand's 
state variables and, in the GA, each state variable 
corresponds to a gene. The ligand's state corre- 
sponds to the genotype, whereas its atomic coordi- 
nates correspond to the phenotype. In molecular 
docking, the fitness is the total interaction energy 
of the ligand with the protein, and is evaluated 
using the energy function. Random pairs of indi- 
viduals are mated using a process of crossover, in 
which new individuals inherit genes from either 
parent In addition, some offspring undergo ran- 
dom mutation, in which, one gene changes by a 
random amount. Selection of the offspring of the 
current generation occurs based an the individual's 
fitness; thus, solutions better suited to their envi- 
ronment reproduce, whereas poorer suited ones 
die. 

A variety of approaches have been adopted to 
improve the efficiency of the genetic algorithm. 
Classical genetic algorithms represent the genome 
as a fixed-length bit string, and employ binary 
crossover and binary mutation to generate new 
individuals in the population, Unfortunately, in 
many problems,, such, binary operators can gener- 
ate values that are often outside the dornain of 
interest leading to gross ixieftidencies in the search. 
The use of real encodings helps to limit the genetic 
algorithm to reasonable domains. Alternative ge- 
netic algorithms have been reported 25 that employ 
more complicated representations and more so- 
phisticated operators besides crossover and muta- 
tion. Some of these retain the binary represen- 
tation,, but must employ decoders and repair 
algorithms to avoid' building illegal individuals 
from the chromosome, and these are frequently 
computationally intensive. However, the search 
performance of the genetic algorithm can be im- 
proved by introducing a local search method. 2 ^' 27 

HYBRID SEARCH METHODS IN AUTODOCK 

Earlier versions of AuTODooc used optimized 
variants of simulated annealing. 6 ' 7 Simulated an- 
nealing may be viewed as having both global and 



local search aspects, performing a more global 
search early in the run, when higher temperatures 
allow transitions over energy barriers separating 
energetic valleys, and latex on performing a more 
local search when lower temperatures place more 
focus on local crptimization in the current valley. 
AtjtoDock 3.0 retains the functionality of earlier 
versions, but adds the options of using a genetic 
algorithm (GA) for global searching, a local search 
(L5) method to perform energy rniruntization, or a 
combination of both, and builds on the work of 
Belew and ITarL 27 - 28 The local search method is 
based an that of Soils and Wets/ 9 which has the 
advantage that it does not require gradient infor- 
mation about the local energy J and scape, thus fa- 
cilitating torsional space search. In addition, the 
local search method is adaptive, in that it adjusts 
the step size depending upon the recent history of 
energies: a user-defined number of consecutive 
failures, or increases in energy, cause the step size 
to be doubled; conversely, a user-defined number 
of consecutive successes, or decreases in energy, 
cause the step size to be halved. The hybrid of the 
GA method with the adaptive LS method together- 
form the so-called lumnrckian genetic algorithm 
(LGA3, which has enhanced performance relative 
to simulated annealing and GA alone, 21 ' 26 and is 
described in detail later. Thus, the addition of 
these new GA-based docking methods enhances 
AtjtoDock, and allows problems with more de- 
grees of freedom to be tackled. Furthermore, it is 
now possible to use the same force field as is used 
in docking to perform energy nun Imitation of 
ligands. 

IMPIJEMENTATI ON 

In our implementation of the genetic algorithm, 
the chromosome is composed of a s bring of real- 
valued genes: three Cartesian coordinates for the 
ligand translation; four variables . defining a 
quaternion specifying the ligand orientation; and 
one real-value for each ligand torsion, in that or- 
der. Quaternions are used to define the orienta- 
tion 30 of the ligand, to avoid the gimbal lock 
problem experienced with Euler angles. 31 The or- 
der of the genes that encode the torsion angles is 
defined by the torsion tree created by AlttoToks, a 
preparatory program used to select rotatable bonds 
in the ligand. Thus, there is a one-to-one map ping 
from the ligand's state variables to the genes of the 
individual's chromosome. 

The genetic algorithm begins by creating a ran- 
dom population of individuals, where the user 
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defines the number of individuals in the popula- 
tion. For each random individual in the initial 
population, each of the three translation genes for 
x, y, and z is given a uniformly distributed ran- 
dom value between the ininimum and maximum 
x, y, and z extents of the grid maps, respectively; 
the four genes defining the orientation are given a 
random qixatemiorv consisting of a random unit 
vector and a random rotation angle between — 180° 
and H- 180°; and the torsion angle genes, if any, are 
given random values between — 180° and 4- 180°. 
Furthermore, a new random number generator has 
been introduced that is hardware-independent 32 It 
is used in the LS, GA, and "LGA search engines, 
and allows results to be reproduced on any hard- 
ware platform given the same seed values. The 
creation of the random initial population is fol- 
lowed by a loop over generations, repeating until 
the maximum number of generations or the maxi- 
mum number of energy evaluations is reached, 
whichever comes first. A generation consists of 
five stages: mapping and fitness evaluation selec- 
tion, crossover, mutation,, and elitist selection, in 
that order. In the Lamaxckian GA, each generation 
is followed by local search, being performed on a 
userniefiiied proportion of the population. Each of 
these stages is discussed in more detail in what 
follows. 

Ivtepping translates from each individual's geno- 
type to its corresponding phenotype, and occurs 
over the entire population. This allows each indi- 
vidual's fitness to be evaluated. This is the sum of 
the intermoleculax interaction energy between the 
ligand and the protein, and the intramolecular 
interaction energy of the Uganda The physicocherrv- 
ical nature of the energy evaluation function is 
described in detail later. Every time an individual's 
energy is calculated, either during global or local 
search, a count of the total number of energy 
evaluations is incremented. 

This is followed, in our implementation, by pro- 
portional selection to decide which individuals will 
reproduce. Thus, individuals that have better- 
than-avexage fitness receive proportionally more 
offspring, in accordance with: 



n 0 « - 



where n 0 is the integer number of offspring to be 
allocated to the individual; f f is the fitness of the 
individual G.e„ the energy of the ligand); / w is the 
fitness of the worst individual, or highest energy, 



in the last N generations (i.e., N is a user-defina- 
ble parameter, typically 10); and </> is the mean 
fitness of the population. "Because the worst fitness, 
f aJ will always be larger than either or </), 
except when f { «/„/ then for individuals that have 
a fitness lower than the mean, f { < (/), the nu- 
merator in this equation, f w — f {/ will always be 
greater than the denominator f m — and thus 

such, individuals will be allocated at least one. 
offspring, and thus will be able to reproduce. 
AirroDoCK checks for f w = <£/> beforehand, and if 
true, the population is assumed to have con- 
verged, and the docking is terminated. 

Crossover and mutation are performed on ran- 
dom members of the population according to 
user-defined rates of crossover and mutation. First, 
crossover Is performed. Two-point crossover is 
used, with breaks occurring only between genes, 
never within a gene — this prevents erratic changes 
in the real values of the genes. Thus, both parents' 
chromosomes would be broken into three pieces at 
the same gene positions, each piece containing one 
or more genes; for instance, , ABC and abc. The 
chromosomes of the resulting offspring after two-, 
point crossover would be AbC and aBc. These 
offspring replace the parents in the population/ 
keeping the population size constant. Crossover is 
followed by mutation; because the translational, 
orientational, and torsional genes are represented 
by real variables, the classical bit-flip mutation 
would be inappropriate- Instead, mutation is per- 
formed by adding a random real number that has 
a Cauchy distribution to the variable, the distribu- 
tion being given by: 



tt(0 2 + (x - a) 2 ) 

a £l 0, p > 0, -co < x < 00 



where <x and /3 are parameters that affect the 
mean and spread of the distribution- The Cauchy 
distribution has a bias toward smalt deviates, but, 
unlike the Gaussian distribution, it has thick tails 
that enable it to generate large changes occasion- 
ally. 26 

An optional user^lefined integer parameter 
elitism determines how many of the top individu- 
als automatically survive into the next generation. 
If the elitism parameter is non-zero, the new popu- 
lation that has resulted from the proportional se- 
lection, crossover, and mutation is' sorted accord- 
ing to its fitness; the fitness of new individuals 




having resulted from crossover and/or mutation 
is calculated as necessary. Because populations are 
implemented as heaps, selection of the best n 
individuals is efficient. 

The genetic algorithm iterates over generations 
until one of the termination criteria is met At the 
end of each docking, AtjtoDock reports the fitness 
(the docked energy), the state variables, and the 
coordinates of the docked conformation, and also 
the estimated free energy of binding. AutoDock 
performs the us er-sp e rilied number of GA dock- 
ings, and then carries out conformational cluster 
analysis on the docked conformations to determine 
•which are similar, reporting the clusters ranked by 
increasing energy. 

LAMARCKUN GENETIC ALGORITHM 

The vast majority of genetic algorithms mimic 
the major characteristics of Darwinian evolution 
and apply Mendelian genetics. This is illustrated 
on the right-hand side of Figure 1 (note the one- 
way transfer of irrformation from the genotype to 
the phenotype). However, in those cases where an 
inverse mapping function exists (i.e., one which 
yields a genotype from a given phenotype), it is 
possible to finish a local search by replacing the 
individual with the result of the local search; see 
the left-hand side of figure 1. This is railed the 
Lamar ckian genetic algorithm (LGAX and is an 
allusion to Jean Batiste de Lamarck's (discredited) 
assertion that phenotypic characteristics acquired 
during an individual's lifetime can become herita- 
ble traits. 33 

The most important issues arising in hybrids of 
local search (LS) techniques with the GA revolve 
around the developmental mapping, which trans- 
forms genotypic representations into phenotypic 
ones, 26 The genotypic space is defined in terms of 
the genetic operators — mutation, and crossover in 
our experiments — by which parents of one genera- 
tion are perturbed to form their children. The phe- 
notypic space is defined directly by the problem, 
namely, the energy function being, optimized* The 
local search operator is a useful extension of GA 
global optimization when there are local "smooth- 
ness" characteristics (continuity, correlation,, etc) 
of the fitness function that local search can exploit. 
In hybrid GA + LS op tirmzations, the result of the 
LS is always used to update the fitness associated 
with an individual in the GA selection algorithm 
If, and only if, the developmental mapping func- 
tion is invertible, will the Lamarckian option — 
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FiGURE 1. This figure illustrates genotypic and 
phenotypic search, and contrasts Darwinian and 
Lamarckian search. 27 The space of the genotypes is 
represented by the lower horizontal line, and the space 
of the phenotypes is represented by the upper horizontal 
line. Genotypes are mapped to phenotypes by a 
developmental mapping function. The fitness function Is 
f(x). The result of applying the genotypic mutation 
operator to the parent's genotype is shown on the 
right-hand side of the diagram, and has the 
corresponding phenotype shown. Local search is shown 
qn the left-hand side, ft is normally performed in 
phenotypic space and employs information about the 
fitness landscape. Sufficient iterations of the local search 
arrive at a local minimum, and an inverse mapping 
function is used to convert from its phenorype to its 
corresponding genotype. In the case of molecular 
docking, however l local search is performed by 
continuously convening from the genotype to the 
phenotype, so Inverse mapping Is not required. The 
genotype of the parent is replaced by the resulting 
genotype, however, in accordance with Lamarckian 
principles. 



c caw erring the phenotypic result of "LS back into its 
corresponding genotype become possible, 

In our case, the fitness or energy is calculated 
from the ligand's coordinates, which together form 
its phenotype. The genotypic representation of the 
ligand, and its mutation and crossover operators, 
have already been described- The develop mental 
mapping simply transforms a molecule's geno- 
typic state variables into the corresponding set of 
atomic coordinates. A novel feature *of this applica- 
tion of hybrid global-local optimization is that the 
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Solis and Wets LS operator searches through the 
genotypic space ratiier than the more typical pher 
notypic space. This means that the developmental 
mapping does not need to be inverted. Nonethe- 
less, this molecular variation of the genetic algo- 
rithm a till qualifies as Lamarckiaxv because any 
''errvirorunental adaptations" of the ligand ac- 
quired during the local search, will be inherited by 
its offspring. 

At each generation, it is possible to let a user- 
defined fraction of the population undergo such a 
local search. We have found improved eiHdency 
of docking with local search frequencies of just 
0.06, although a frequency of 1.00 is not signifi- 
cantly more efficient. 26 Both the canonical and a 
slightly modified version of the Solis and Wets 
method have been implemented^ In canonical Solis 
and Wets, the same step size would be used for 
every gene, but we have improved the local search 
efficiency by allowing the step size to be differ- 
ent for each, type of gene: a change of 1 A (1 A. = 
10" 10 m) in a translation gene could be much 
more significant than a change of 1° in a rotational 
or torsional gene. In the docking experiments pre- 
sented here, the translation al step size was OJZ A, 
and the orientatiunal and torsional step sizes 
were 5°. 

La the Lamarckian genetic algorithm, genotypic 
mutation plays a somewhat different role than it 
does in traditional genetic algorithms. Tradition- 
ally, mutation plays the role of a local search 
operator, allowing small, refining moves that are 
not efficiently made by crossover and selection 
alone. With the explicit local search operator, how- 
ever, this role becomes unnecessary, and is needed 
only for its role in replacing alleles that might have 
disappeared through selection. In LGA, mutation 
can take on a more exploratory role. The Cauchy 
deviates are a compromise between radical jumps 
to arbitrary sections of the conf orma tion space and 
detailed exploration of the local topography. 

DERIVATION OF THE EMPIRICAL BINDING 
FREE ENERGY FUNCTION 

The study of molecular structure underpins 
much of computational molecular biology. There 
are several established methods for performing 
molecular mechanics and molecular dynamics, no- 
tably Amber/ 4 - 95 Chaemm, 36 Discover, 37 Ecepp, 3 * 
and Gromcs, 39 Many of these traditional force 
fields model the interaction energy of a molecular 
system with terms for dispersion/repulsion; 40 hy- 
drogen. bonding, 41 electrostatics, 42 ~* 5 and devia- 



tion from ideal bond lengths and bond angles. 
These methods are excellent for studying molecu- 
lar processes over rime, for optimizing bound con- 
formations, and for perfbxxning free energy per- 
turbation calculations between molecules with, a 
single atom change, 46 but they often require con- 
siderable investments of computer time and, un- 
fortunately, these approaches tend to perform less 
well in ranking the binding free energies of com- 
pounds that differ by more than a few atoms. 
What is needed is an empirical relationship 
between molecular structure and binding free 
energy. 

The first thoroughly established linear free en- 
ergy relationship was observed by Hamrnett as 
early as 1933, and reported in 1927 F It was used 
to relate structure and reactivity of small organic 
molecules on a quantitative basis. Harnmett was 
able to. derive substituervt constants and reaction 
constants that could then be used to calculate rate 
constants and equilibrium constants for a specific 
reaction of a specific compound. It could be said 
that Hammerf s work was the forerunner of mod- 
ern-day quantitative structure— activity relation-, 
ships (QSAR), pioneered by I-Tansch and cowork- 
ers in the 1960s. Here it is assumed that the sum of 
the steric, electronic, and hydrophobic effects of 
substituents in a compound determines its biologi- 
cal activity; see, for example, Fujita, 4 * Hansch, 49 
and more recently Selassie et al. 30 

Current structure-based scoring functions seek 
to remedy some of the deficiencies of txaditional 
force fields by developing empirical free energy 
functions that reproduce observed binding con- 
stants. Most of these approaches use an expanded 
'"master equation" to model the free energy of 
binding, adding entropic terms to the molecular 
mechanics equations 51 : 

AG = AG vdw + AGfct^a -f- AG alac -t* AGcarJorm 

+ AG tot + AG fiol 

where the first four terms are the typical molecular 
mechanics terms for dispersion/ repulsion, hydro- 
gen bonding, electrostatics, and deviations from 
covalent geometry, respectively; AG toT models the 
reslxiction of internal rotors and global rotation 
and translation; and AG wl models desolvation 
upon binding and the hydrophobic effect (solvent 
entropy changes at solute -solvent interfaces). This 
latter term is the most challenging. Most workers 
use variants of the method of Wesson and Eisen- 
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berg, 52 calculating a desolvation energy based on 
the surface area buried upon complex formation, 
■with the area, of each buried atom being weighted 
by an atomic solvation parameter. Bohrn. built on 
earlier work with the de novo inhibitor design 
program Ltoti, 53 and used linear regression to cali- 
brate a similar function against a set of 45 diverse 
proteirv-ligand. complexes with published binding 
constants,^ The final function predicted binding 
constants for a set of test complexes with a stan- 
dard deviation equivalent to about a factor of 25 in 
binding constant!' more than sufficient to rank in- 
hibitors with. miHimjular, micromolar, and nanomo- 
lar binding constants. Jain devised a continuous, 
differ entiable scoring function, 55 which is, in 
essence, very similar to that of Bohm, but based on 
non-physical pairwise potentials using Gausslarts 
and sigmoid al terms. 

We have implemented a similar approach using 
the thermodynamic cycle of Wesson and Eisen- 
b erg 52 The function includes five terms: 



AG - 



(A 



(J 



V 
'til 
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D 



r 12 



r 10 



+ AG hbcmd i:E(f) 

+ AG^JV^ 



(1) 



where- the five AG terms on the right-hand side 
are coefficients empirically determined using lin- 
ear regression analysis from a set of protein- ligand 
complexes with known binding constants, shown 
in Table I The summations are performed over all 
pairs of ligand atoms, i, and protein atoms, j, in 
addition to all pairs of atoms in the ligand that are 
separated by three or more bonds. 

The in vacuo contributions include three interac- 
tion energy terms, used in previous versions of 
AtjtoDoqc a Lennard-Jones 12-6 dispersion /re- 
pulsion term; a directional 12-10 hydrogen bond- 
ing term, where E(f ) is a directional weight based 
on the angle, f , between the probe and the target 
atom 9 ; and a screened Coulombic electrostatic po- 
tential. 56 Each of these terms, including their pa- 
rameterization, have already been described. 7 

A measure of the unfavorable entropy of ligand 
binding due to the restriction of cxiniormatioxial 
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degrees of freedom is added to the in vacuo func- 
tion. This term is proportional to the number of 
sp 3 bonds in the ligand, N^. 54 We investigated 
variants that included and excluded me thy L hy- 
droxy!, and amine rotors. 

In the development of an empirical free energy 
function for AuroDoac, the desolvation term was 
most chaUenging^ because AutoDoqc uses a grid- 
based method for energy evaluation, and most 
published solvation methods are based on surface 
area calculations. We investigated two different 
methods of calculating the derivation energy 
term. The first of these methods was based on 
estimating atom-by-atom contributions to the in- 
terracial molecular surface area between the ligand 
and the protein using the difference in the surface 
areas of the complex and the unbound protein and 
unbound ligand. Both the solvent-accessible and 
solvent-excluded surface areas were considered, 
being calculated with MSMS/ 7 a fast and reliable 
program that computes analytical molecular sur- 
faces. Unfortunately, there can be significant errors 
in the value of the interracial solvent-accessible 
surface areas, due to the '■'collar" of accessible- 
surface that surrounds the ligand— protein interface 
in the complex. We also tested seven variants of 
the -pairwise, volume-based method of Stouten 
et al. 58 : this method has the advantage that it is 
consistent with the pre-calculated affinity grid for- 
mulation used by ATrroDocK- For each atom in the 
ligand, ixagmental volumes of surrounding protein 
atoms are weighted by an exponential function 
and then summed, evaluating the percentage of 
volume around the ligand atom that is occupied 
by protein atoms. This percentage is then weighted 
by the atomic solvation parameter of the ligand 
atom to give the desolvation energy- The full 
method may be broken into four separate compo- 
nents: burial of apolar atoms in the ligand, burial 
of apolar protein atoms, burial of polar and charged 
atoms in the ligand, and burial of polar and 
charged protein atoms. Great success has also been 
reported in using simply the amount of hydropho- 
bic surface area buried upon cample xau'on as a 
measure of the ''hydrophobic effect/' 54 so we 
tested several formulations that included only the 
volume lost around ligand carbon atoms. The 
burial of polar atoms caused particular problems, 
as discussed in what follows. Apart from the vol- 
ume-based method, we tested a simpler formula- 
tion for the solvent transfer of polar atoms; that is, 
a constant term corresponding to the 'favorable free 
energy of interaction of a polar atom with solvent 
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TABLE S. - f 

Protein -LJgand Complexes Uasd to C&Ubr&to Empirical Free Energy Function, Along wfth Bmokhaven Protein 
Dsta Bank (PDS) Acceaakm Coded and Binding. 



Protein-ligand complex 


PDB code 




Concanavalin A / a-methy(-o-mannopyranoslda 


4cna 


2.00 


Garboxypepridase A / glycyi-L-tyrosine 


3cpa 


3-88 


Carboxypeptidase A / phosphonate 2AA= P = (O) F 


6cpa 


11,52 


Cytochrome P-450 ( _ Arn / camphor 


2cpp 


6.07 


Dihydrofolate reductase/ methotrexate 


a -if- 

4ctfr 


9.70 


a-Thrombln / benzamidine 


J J. i ,L 

1dwb 


2.92 


En doth ia pe psl n / H-256 


zer6 


7.22 


^-Thrombin / MQPA 


tetr 


7-40 


*Y*I i i ILIA a r*^ 

^-Thrombin / NAPAP 


lets 


8.52 


^-Thrombin / 4-TAPAP 


1 en 


6.19 


FK506-binding protein (FKBP) / immunosuppressant FK506 


1fkf 


9.70 


o-Oalactose / ocjlucose binding protein / galactose 


2gbp 


7.60 


Hemagglutinin /sialic acid 


4hmg 


2.55 


HfV-1 Protease /A78791 


1hvj 


10.46 


HlV-1 Protease /MVT1 01 


4hvp 


6.15 


HiV-1 Protease / acylpepstatine 


5nvp 


5,96 


HIV-1 Protease / XK263 


Ihvr 


9.51 


Fatry-acid-binding protein / C 15 COOH 


2lfb 


5.43 


Myoglobin (ferric) / Imidazole 


1 mbi 


. 1.88 


McPC803 / phosphochollne 


2mcp 


5.23 


^-Trypsin / benzamldlne 


3ptb 


4.74 


Retino (-binding protein J retinal 


Irbp 


6.72 


Thermolysin / Lou-hydraxylamine 


4tln 


3.72 


Thermolysin / phosphoramidon 


Itlp 


7.55 


Thermolysin /n-(l-carboxy-3-phenylpropyO-Leu-Trp 


Itmn 


7.30 


Thermolysin / Cbz-Phe^p-Leu^Ala (ZFpLA) 


4tmn 


10.19 


Thermolysin / Cbz-GIy-p-Leu-Leu (ZGpL.0 


5tmn 


8.04 


Purine nucleoside phosphoryiase (PNP)/ guanine 


luib 1 


5.30 


Xylose isomerase/CB37l7 


2xis 


5,82 


Triose phosphate isomerase tTlM) / 2-phosphogly colic acid (PQA) 


2ypi 


4.82 



* Adapted from BShm. 



Thirty proteia-ligand complexes with pub- 
lished binding constants were used ia the calibra- 
tion of AutoDock's fee energy function (Table T), 
and were chosen from the set of 45 used by Bohm, 54 
omitting all complexes that he modeled (i.e., using 
only complexes for which crys tall o graphic struc- 
tures -were available). One of the limitations of 
these "binding constant data is that the conditions 
under which they were determined vary, which, 
intrinsically limits the accuracy of our best model. 
We converted between the inhibition constant, K } , 
and the observed free energy change of binding, 
AG^j using the equation: 

where R is the gas cons bint, 1,987 cal K" 1 mol~\ 
and T is the absolute temperature, assumed to be 



is estimated, and this is subtracted from the bind- 
ing free energy- 

Trilinear interpolation is used to evaluate 
rapidly the intermolecular dispereinn/repulsion 
energy ; the hydrogen bonding energy, the electro- 
static potential, and the solvation energy of each 
atom in the ligand, using grid maps that have been 
pre-calculated over the protein for each atom type 
in the ligand. In AutoTX)CK 3.0, we have imple- 
mented a faster method of trilinear interpolation 59 
than was available in earlier versions of AuroDoQc 
Both methods are mathematically equivalent The 
original implementation used 24 multiplications to 
perform each three-dimensional trilinear interpola- 
tion, but, by cascading seven cd^e-dirnensional in- 
terpolations, the number of multiplications nas 
been reduced to 7. 
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room temperature, 298.15 K 60 Note that this equa- 
tion lacks a minus sign because the inhibition 
constant 13 defined for the dissociation reaction, 
EI ^ E + I, 61 whereas AG^ refers to the opposite 
process of binding, E -f I ^ EI; 1 where E is the 
enzyme and I is the inhibitor- 

To remove any sfceric clashes in the crystallo 
gtaphic complexes,, each ligand was optimized us- 
ing AutoDock's new So lis and Wets local rnini- 
xnlzatiori technique described earlier, but with, 
tihe previously reported, force field. 7 The separate 
contributions from the hydrogen bonding, dispex- 
sion/repulnLorv electrostatic and solvation ener- 
gies were evaluated. Empirical free energy coeffi- 
cients for each of these terms were derived using 
linear regression in the S-plus software packaged 
and cross-validation studies were performed. In 
total, 900 different binding free energy models 
were tested; each linear model consisted of a van 
der Waals term, a hydrogen bonding term (one of 
6 variants), an electrostatic term, a torsional en- 
tropy term (one of 5 variants), and a desolvation 
term (one of 15 variants). We also investigated 
whether the inclusion of a constant term improved 
the modeL Six of the seven test systems used to 
test the docking procedure, which were originally 
used to test AuTODOCK, version 2.4/ were also in 
the training set of 30 protein— li garni complexes; 
therefore, to validate the chosen coefficients, linear 
regression was repeated for the set of 24 
protein.— ligand complexes, excluding the 6 over- 
lapping test systems. 

TESTING DOCKING METHODS ' 

Seven protein-ligaxid complexes, with a range 
of complexity and chemical properties were cht> 
sen from the Brookhaven Protein Data Bank 63 ' 64 to 
compare the performance of the docking tech- 
niques (see Fig. 2X To facilitate comparison with 
the previous force field, we chose the same set of 
six test systems investigated earlier, 7 but added a 
harder docking problem to challenge all true search, 
methods (see Table E). The simplest test cases 
were the ^trypsin/benzarMdine and cytochrome 
/camphor complexes, which had smalL 
rigid ligands. Interactions in the former are domi- 
nated by electrostatic mter actions and hydrogen 
bonds to the substrate amidine, whereas the latter 
is dominated by hydrophobic interactions- McPC- 
603 /phosphocholine and strepUvidin/biotLa were 
moderately flexible, and represented test systems 
having an intermediate level of difficulty. HTV-1 
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prortease/XK263 7 hemagglutirun/sialic add, and 
dihydrofblate reductase /methotrexate provided 
more difficult tests, with many rota table bonds 
and diverse chemical characteristics. . 

We compared the performance of Monte Carlo 
simulated annealing (SA), the genetic algorithm 
(GA), and the Lamar ckian genetic algorithm (LGA). 
The new empirical free energy function presented 
here was used for energy evaluation in all cases. 
Dockings were performed using approximately the 
same number of energy evaluations (~ 1.5 mil- 
lion)/ so each method could be judged given simi- 
lar comp utational investments- The CPU time 
taken for a single docking varied ftom 4.5 to 41.3 
minutes, on a 200-MHz Silicon Graphics MIPS 
4400 with 12S MB of RAM, depending on the 
number of rotatable bonds and the number of 
atoms in the ligand. 

At the end of a set of dockings, the docked 
conformations were exhaustively compared to one 
another to determine similarities, and were clus- 
tered accordingly. The user-defined root-mean- 
square positional deviation (rmsd) tolerance was 
used to determine if two docked c onf ormatLons 
were similar enough to be included in the same 
cluster, and symmetrically related atoms in the 
ligand were considered. These clusters were ranked 
in order of increasing energy, by the lowest energy 
in each cluster. Ordinarily, the structure of the 
protein-ligand complex would not be known, so 
the criteria by which the dockings would be evalu- 
ated are the energies of the docked structures, and, 
in cases where there are several plausible, low-en- 
ergy structures, the number of c onf orma ti ons in a 
c onf ormati onally similar cluster. Because one of 
our goals was to test the ability of the methods to 
reproduce known structures, we also compared 
the rmsd between the lowest energy docked struc- 
ture and the crys tall o graphic structure. 

DOCKING- AND SEARCH-METHOD-SPBCIFIC 
PARAMETERS 

The proteins and ligands in the seven docking 
tests were treated using the united-atom approxi- 
mation, and prepared using the molecular model- 
ing program, Syhyl 65 Only polar hydrogens were 
added to the protein, and Kollman united-atom, 
partial charges were assigned. Unless stated other- 
wise, all waters were removed- Atomic solvation 
parameters and fragmental volumes were assigned 
to the protein atoms using a new AtrroDcxx util- 
ity, AddSol. The grid maps were calculated using 
AtjtoGrld, version 3.0. In all seven protein-ligand 
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(d) (e) 




(f) (g) 




FIGURE 2. The seven ligands chosen for docking, showing the rptatable bonds as curly arrows: (a) benzamidine; (b) 
camphor; (cD phosphocholine; (d) biotin; (e) H1V-1 protease inhibitor XK-263; (f) isopropylated sialic acid; and (g) 
methotrexate. Note that two iigands, (e) and (f), contain hydroxy! rotors, which are not counted in the total number of 
torsional degrees of freedom; note also that cyclic rotatable bonds are excluded. 



cases, we used grid maps with 61 X 61 X 61 points, 
a grid-point spacing "of 0.375 A, and, "because the 
location of the Hgand in the complex "was known, 
the maps were centered on the ligand's "binding 
site. The ligands were treated in SybyX initially as 
all atom entities, that is, all hydrogens were added, 
then partial atomic charges were calculated using 
the Gasteiger-Marsili method, 66 ' 67 AtjtoToks, an 
AuroDoQC utility / was used to define the rota table 
bonds in the ligand, if any, and also to unite the 



nnnpolar hydrogens added by Sybyl fox the partial 
atomic charge calculation. The partial charges on 
the nonpolar hydrogens were added to that of the 
hydrogen-bearing carbon also in AuroToRs. 

In all three search methods, 10 dockings were 
performed; in the analysis of the docked conforma- 
tions, the clustering tolerance for the root-mean- 
square positional deviation was 0.5 A, and the 
cry stalLo graphic coordinates of the Hgand were 
used as the reference structure. For all three search 



1643 



VOL 19, NO, 14 




AUTOMATED DOCKING 

TABLE IL 



Protein-ligand complex 



^-Trypsin / benzamicline 
Cytochrome P-^50 cam / camphor 
McPC-603 / Phosphochollne 
Streptavidin/biotin 
HIV-1 protease /XK2G3 

Influenza hemagglutinin /3ialic acid 
Dihydrofolate reductase / methoirexate 4dfr 



PDB 


Resolution 




code 


eft 


Reference 


3ptb 


1,7 


69 


2cpp 


1,63 


70 


2mcp 


3.1 


71 


1stp 


2.6 


73 


1hvr 


1.8 


75 


4hmg 


3.0 


76 


4dfr 


1.7 


79 



Number 
of 

rotatable 
bonds 



N 



lor 



Total 
number of 
degrees of 

freedom 



Energy 
of crystal 
structure 
(kca! mo/" 1 ) 



0 
0 
4 
5 

10 

11 
7 



0 
0 
4 
5 
8 
7 
7 



7 
7 
11 
12 
17 
18 
14 



-7.86 
-4.71 

+s.4a b 

-8.86 
-18.62 
-4.71 
-13.64 



B ~ _ _ ____ ' I 0.<^+ . 



methods, the step sizes were 02 A for translations 
and 5° for orientations and torsions. These step 
sizes determined the amount by which a state 
variable could change when a move is made in 
simulated annealing and the relative size of muta- 
tion in the local search, whereas the a and £ 
parameters determined the size of the mutation in 
the genetic algorithms, GA and LGA. The Cauchy 
distribution parameters were a = 0 and j9 = L 
Note that in simulated annealing, random changes 
were generated by a uniformly distributed random 
number generator; in the Solis and Wets local 
search, by a normal distribution; and, in the ge- 
netic algorithm, by a Cauchy distribution. In the 
simulated annealing tests, the initial slate of the 
ligand was chosen randomly by AtxroDooc We 
used the optimal set of simulated annealing pa- 
rameters that were determined from the schedule 
experiments described, earlier/ These included an 
initial annealing temperature of 616 cal mol" 1 ^ a 
linear temperature reduction schedule, 10 runs, 50 
cycles, and a cyde- termination criterion of a maxi- 
mum of 25X100* accepted steps or 25,000 rejected 
steps,, whichever came first. The rrdnimum energy 
state was used to begin the next cycle; the only 
exception was for lhvr, where the initial annealing 
temperature was increased to 61,600 cal mol" 1 . 
The maximum initial energy allowed was 0.0 kcal 
mol" 1 , and the maximum number of retries was 
1000, used to generate a low energy random initial 
state to begin each simulated annealing docking. 



In the GA and LGA dockings, we used an initial 
population of random individuals with a popula- 
tion size of 50 individuals; a maximum number of 
1-5 X 10 6 energy evaluations; a maximum number 
of generations of 27,000; an elitism value of 1, 
which was the number of top individuals that 
automatically survived into the next generation; a 
mutation rate of 0,02, which was the probability 
that a gene would undergo a random change; and 
a crossover rate of 0,80, which was the probability 
that two individuals would undergo crossover. 
Proportional selection was used, where the aver- 
age of the worst energy was calculated over a 
window of the previous 10 generations. In the 
LGA dockings, the pseudo-Solis and Wets local 
search method was used, having a maximum of 
300 iterations per local search; the probability of 
performing local search on an individual in the 
population was 0,06; the maximum number of 
consecutive successes or failures before doubling 
or halving the local search, step size, p, was 4, in 
both cases; and the lower bound on p, the termi- 
nation criterion for the local search, was 0.01. 



Results and Discussion 

CALIBRATION OE EMPIRICAL EREE 
ENERGY FUNCTION 

Several linear regression models were tested for 
their ability to reproduce the observed binding 
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TABLE III. m 

Calibration of Empirical Free Energy Function. 



Residual 
standard 
Model* error 



A 2.324 
B 2.232 
C 2.177 



Multiple 












0.3488 
0.9537 
0.9559 


0.1795 
(0.0263) 

0.1518 
(0.0269) 

0.1435 
(0.0237) 


0.1133 
(0.0324) 

0.1186 
(0.0246) 

0.1 14$ 
(0.0238) 


0.01 66 
(0.0825) 

0.0126 
(0.0382) 

0.0656 
(0,0558) 


0-3100 
(0.0873) 
' 0.3546 
(0-089O) 

0.3113 
(0.0910) 


0.0101 
(0.0585) 

0.1539 
(0.1050) 

0.1711 
(0.1035) 



, _^ . , .„ ' „ , , — u ~ "; ull V" wiuiiiu ibiiii. moaei a: tuii voiume-Dased solvation term 

and standard 10-1 2 hydrogen bonding, as In Eq. (1). Model B; apolar ligand atoms only In the solvation term, and standard 1 0-1 2 
hydrogen bonding. Model C: apolar ligand atoms only in the eoNation am, and the standard 10-12 hydrogen less the estimate 
average, &e JrvEq. (2). 

VaJues for the model coefficients, with standard deviations In paremhes&s. 



constants of structurally characterized complexes. 
Table HI shows the results for the three major 
candidates, and Figure 3 shows the correlation 
between, the observed and the predicted binding 
free energies for the 30 protein-ligand complexes 



in the calibration set, using the chosen model 
(model C). Model A adds the full volume-based 
solvation method and the torsional restriction term, 
to the original molecular mechanics force field. 
Model B simplifies the solvation method by evalu- 




Obscrved "binding free energy (kcsl/mol) 



FIGURE 3- Predicted versus observed binding frse energies for the calibration set and the docking testa. The solid 
line shows a perfect fit, and the dotted lines show one standard deviation above and below this. Hollow diamonds show 
the 30. protein-ligand complexes used In fitting trie terms of the binding free energy. function. Solid .triangles show the 
results of the simulated annealing (SA) dockings, solid diamonds show the genetic algorithm (OA) dockings, and the 
solid squares show the Lamarckian genetic algorithm (LGA) dockings. Note the outlying biotin-streptavidin complex 
(Istp), where it is believed there are significant crjntriburlons'to the binding free energy due to protein rearrangements. 
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ml B« ligand. Model C also uses only ligand car- 
W atoms in the desolation cdculatL^ also 
rWH™"^,* teim f ° h ^«>gen bonding 

Dcor q n ^ f ° r ^^ration into Aim> 

^ 011 ite better overall statistics, and 
- mteria discussed in what follows. The form of 
™s free energy function is: ' 



AG 



r - 6 - , 



/ 



G 



r 12 

V ij 



Glee <L-i r \ 



D 



.10 



+ E 



Kb and 



(2) 



where E hboni is the estimated average enerev of 
hydrogen bonding of water with a polar atomfand 
the summation in the solvation term is perf ormed- 
ova all pairs consisting of only carbon atoms in 
the hgand, /, and atoms of all types / in the 
protein. Note that the internal or intramolecular 
afteraction energy of the ligand is not included in 
. the calculation of binding free energy; during 

tllS^iTT*' toterrifll * deluded in 

the total docked energy, because changes in ligand 

conformation can affect Hie outcome of the dock- 
ing, so this must be taken into consideration. We 
looked at linear regression models that did include 
the internal energy, and found that adding this 
term did not improve the model. The assumption 
made is that'fhe internal energy of the ligand in 
solution and in the complex are the same The 
energies used and reported by AUTODoac should 
be distinguished: there are docked energies, which 
include the intermolecular and intramolecular in- 
teraction energies, and which are used during 
dockings; and predicted free energies, which include 
the jntennolecular energy and the torsional free 
energy, and are only reported at the end of a 
docking. Because the intermolecular energy grid 
maps include the derivation term, dockings using 
the new, empirical force field in AutoDock ver- 
sion 3.0 may be qualitatively different from results 
found using earlier versions. 

Three coefficients, for dispersion/repulsion, 
electrostatics, and loss of torsional freedom were 



AUTOMATED DOCKING 
very stable in the linear reeressirm aT . = i ■ • , 

tons and reasonable standard deviate , 5™ 
W model, disp.rsion/repuio^Se^ 
parameters taken from AmbL« wexeweT^Iw 

E E£ ,° £ °if?' ***** ^^y T£lt 

Of teal mol-i for the most favorable atom-atnm 
contacts Electrostatics, modeled wi£ a s^reeS 

0.1146, yielding an energy of about -10 kcal 
mol for an ideal salt bridge. In the farsicS 
restriction term, each torsional L degree^f 
requires 0.3113 kcal mor\ freedom 

wi^S^?* dlfferenCeS brtween - ^odeh occurred 
T 1 ^ 011 ° f * e h ^°g^ bonding term 
i ^ deS ° 1VatiQn telm ' Hydrogen bonding^ 
. modeled with a directional 12-10 poten^^We 
encountered a major problem when calibrating this 
hydrogen bonding function. Because the test «t 

-^--ligand compile 
opfami2ed by millions of years of evolution, hydro- 
gen bonding groups in the ligands are rJarly al- 
ways paired with the appropriate hydrogen borl 
mg group m the protein. Thus, the number of 
hydrogen bonds that the ligand forms in the com-" 
plex and the number it forms with solvent when 
itee . m solution axe approximately the same; that 
is there is Lttie change in the free energy 0 f 
hydrogen bonding, and AG hbond was evaluated to 
be approximately zero. Unfortunately, this pro- 
vides no information on the cost of burying a 
hydrogen bonding group without forming a bLd 
with the protein, and our data set did not include 
cases to evaluate this. Of course, the volume-based 
solvation method should account for this— the un- 
favorable polar contribution to the solvation en- 
ergy should compensate for the favorable 12-10 
hydrogen bonding energy. The linear regression, 
however, consistently returned coefficients that set 
the hydrogen bonding energy and derivation en- 
ergy to nearly zero, and increased the dispersion/ 
repulsion term to compensate (see Model A in 
Table HD. We chose an alternative formulation to 
resolve this problem. 

We obtained the best results by separating the 
derivation of polar atoms from the volume-based 
calculation. We assumed that the extent of hydro- 
gen bonding in the complexes was roughly the 
same as the extent of hydrogen bonding in solu- 
tion. The calculated hydrogen bonding energy us- 
ing the directional 12-10 hydrogen bonding func- 
tion, was divided by the maximal number of 
possible hydrogen bonds, counting' two for each 
oxygen atom and one for each polar hydrogen. For 
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the 30 complexes used in calibration^ it was found 
that 36% of the maximum possible hydrogen 
bonding sites were actually utilized. Values of 
E^hand ranging from 36% to 100% of the maximal 
well depth of 5 kcal mol^ 1 had little effect on the 
success of the formulation (data not shown), and a 
value of 36% was chosen Optimized weights 
yielded an ideal hydrogen bonding energy in the 
complex of —0328 kcal mol" 1 , and the estimated, 
average energy of each hydrogen bond in solution 
of — 0.118- Because hydrogen bonding was mod- 
eled by this difference, the typical hydrogen bond- 
ing free energy of a complex was approximately 
zero, but there was a penalty of about 02 kcal 
mol" 1 for oxygen and nitrogen atoms that did not 
form hydrogen bonds, driving the simulation to- 
ward docked conformations with maximal hydro- 
gen bonding. We are currently exploring an appro- 
priate data set for evaluating this formulation more 
rigorously. 

The desolvation for carbon atoms in the ligand 
was evaluated using two different classes of atom 
type, aliphatic and aromatic, as in the original 
study. 58 The desolvation term was weighted by a 
factor of 0.1711 in our final empirical free energy 
force field, so a typical aliphatic carbon atom yields 
an energy of about — 0*2 kcal mol" 1 upon binding. 

We cross-validated the free energy model in 
two ways. First, we investigated the influence of 
each member of the training set on the final coeffi- 
cients of the model, by removing each one from 
the training set and calculating the coefficients 
from the remaining 29 complexes. We found that 
none of them had a strong effect on the final 
values of the coefficients. 

We also performed a second kind of cross-vali- 
dation of the free energy model, by performing 
Solis and Wets local search using AtjtoDock and 
the Tt0W free energy function, starting from the 
x-ray crystallogxaphic conformations of each" in- 
hibitor in 20 HTV-1 protease-inrubitor complexes, 
to compare the resulting optimized conformations' 
predicted free energy change of binding, AG Undins , 
with the experimentally determined values. These 
protease inhibitors were quite different having 
from 7 to 2JB torsions, and widely different side 
chains — : charged, polar, and hydrophobic, and 
constituted a diverse test set As can be seen from 
the results in Table DC, the correlation was very 
good, with an overall rmsd between the experi- 
mental and calculated values of AG VndiAg of L92 
kcal inol" 1 . 

The final form of the free energy function may 
seem overparameterized, with additional weight- 



ing parameters added to a previously optimized 
parameterLzation. We retained the molecular me- 
chanics formulation, however, specifically for its 
ability to model the distance dependence of each 
energetic term. This distance dependence (and an- 
gular dependence in hydrogen banding) is essen- 
tial for finding valid docked conformations, but 
the amount and resolution of the available pro- 
tem~liganxt data do not support a full re-parneberi- 
zatLon of the functions , 

DOCKING EXPERIMENTS 

Because we are comparing different search 
methods, it is important to ensure that the meth- 
ods are treated equally , It is therefore important 
that each search method be allowed approximately 
the same number of energy evaluations in a dock- 
ing. The number of energy evaluations in a dock- 
ing depends on the termination criteria, and be- 
cause it is not possible to predict how many 
accepted or rejected steps the stochastic SA method 
will make at a given temperature, the number of 
evaluations varies in SA. The range was from 
1.19 X 10 6 to 2.33 x 10 6 , depending on the pro^ 
tein-ligand test system, even though the same 
parameters were used for the number of cycles, 
accepted steps and rejected steps. In the case of the 
GA dockings, the population was 50 and the num- 
ber of generations was 27,000, which gave, a total 
of 1.35 X 10 6 energy evaluations in a docking; . 
thus, the GA dockings were terminated by reach- 
ing the maximum number of generations. In the 
case of the LGA deckings, 6% of the population 
underwent Larnarckian local search, each search 
consisting of 300 iterations and each iteration us- 
ing an extra energy evaluation. Thus, the LGA 
dockings, even with the same population size and 
number of generations as the GA, were terminated 
by reaching. the maximum number of energy eval- 
uations, 1.50 x 10 6 . 

The results of the simulated annealing/ genetic 
algorithm, and the Larnarckian genetic algorithm 
docking experiments are summarized in Tables IV, 
V and VI, respectively. The lowest energy docked 
structure found by each method is compared with 
the crystal structure of the ligand in Figure 4. The 
predicted change in free energy upon binding, 
AGp^ed' lowest energy found by LGA is 

shown in Table VII, along with the experimentally 
observed change in free energy upon binding, 
AG^. In addition, the breakdown of the energy of 
the lowest energy docked conformation is shown, 
in terms of the mterrnolecular interaction energy, 
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TABLE TV, 

Reault© of Simulated Annealing Dockings* 1 



Energy (kcal mo!" 1 ) and rmsd (A) 



AUTOMATED DOCKING 



PDB 

code 


Number 
of 

clusters 


Number 

In 
rank 1 


Lowest 
energy 


rmsd 

of 
lowest 
energy 


Mean 
energy 




Mean 
rmsd 


3ptb 


5 


6 


-8.03 


0.21 


-7.84 


(0.08) 


0.50 


2cpp 


4 


e 


-7,29 


0.81 


-7.22 


(0.03) 


0.91 


2mcp 


10 




-4.09 


0-88 


70.89 


(2.10 x 10 a ) 


5.40 


1stp 


10 




-8.43 


1.27 


-7,71 


(0.66) 


1.24 


Ihvr 


10 




-11.77 


1.15 


1,12 X 10 5 


(3.36 X 10 s ) 


6.13 


4hmg 


10 




-2.59 


3,77 


6.99 X 10* 


(1.52 X 10 s ) 


6.20 


4dfr 


10 




-8.73 


4-83 


6.13 X 10* 


(1.96 X 10 3 ) 


5.04 



(0-17) 
(0,30) 
(4. BO) 
(0.35) 
(2.61) 
(2.94) 
(1 .74) 



Number of 

energy 
evaluations 



2.01 x 10 s 
2,33 x 10 s 
1.85 x 10 s 
2.00 x 10 a 
1.19 X 10 Q 
1.55 X 10 a 
1,30 x 10 s 



The parameters used were 10 runs, 50 cycles, and a cycle-termination criterion of 25,000 accepted 
stapSy whichever came first The rmsd conformational clustering tolerance was 0,5 A, calculated from the 
coordinates. Standard deviations given jn parentheses. 



steps or 25,000 rejected 
igand's crystaJlographic 



AG ini*r *e intramolecular energy, AG^, and the 
torsional free energy, AG WC . These results are dis- 
cussed case-by-case in what follows; "erystallo- 
graphic rmsd 7 '' refers to the r oo t-mean-s qu are posi- 
tional deviation of a given conformation from the 
crystallographic coordinates, 

p Trypsin / Benzarnidinc Optft) 

The recognition of benzamidine by /3- trypsin, 
which binds tightly in the specificity pocket of 
trypsin/ is chiefly due to the polar amidine moiety 



and the hydrophobic benzyl ring. 69 The axnidine 
moiety was treated as being protonated. It was 
assumed that delocalLzarion of the Tr-electrons of 
the benzene ring extended to the ^system of the 
axnidine, and thus the ligand was treated as a rigid 
body. All three search methods succeeded in find- 
ing lowest energy conformations that were also the 
ones with the lowest crys typographic rmsd. In this 
case, the method that found the docked structure 
with the lowest energy was GA (Table V), but that 
found by the LGA method (Table VI) was practi- 
cally the same. The mean of the final docked 



TABLE V. 

Results of Genetic Algorithm Dockings. 8 



Energy (kcal mol ~ "") and rmsd (A) 



PDB 
coda 


Number 
of 

dusters 


Number 

in 
rank 1 


Lowest 
energy 


rmsd 

of 
lowest 
energy 


Mean 
energy 




Mean 
rmsd 




Number of 

energy- 
evaluations 


3ptb 


2 


9 


-8.17 


0-32 


-7.72 


(1 .35) 


1.5Q 


(3.39) 


1.35 x 10 fi 


2cpp 


4 . 


7 


-7.36 


0.33 


-6.65 


(2.11) 


2.1 B 


(3.42) 


1.35 x 10 B 


2mcp 


10 


1 


-5.17 


0.85 


-3.61 


(0.95) 


5.28 


(2.98) 


1.35 x 10 e 


1stp 


7 


4 


-10.09 


0-75 


-8,42 


(1.32) 


2.98 


(3.04) 


1-35 x 10 a 


1hvr 


7 


4 


-21.41 


0.82 


-11,09 


(9.79) 


2.79 


(1.97) 


1.35 X 10 Q 


4hmg 


9 


2 


-7.60 


1.11 


-5.72 


(1.77) 


2.32 


(1.43) 


1.35 X 10 e 


4dfr 


10 


1 


-16.10 


0.95 


-10J24 


(3.95) • 


4.39 


(2.37) 


1.35 X 10 6 



* The parameters used were 1 0 runs a population size of 50, and a ruiv termination criterion of a maximum of 27.000 generations or 
a maximum of 1.5 x 10 B energy evaluations, whichever came first Note mat, In this case all runs terminated after the maximum 
number of generations was reached, which equals the product of the population size and the number of generations. The rmsd 
conformational clustering tolerance was 0.5 A, calculated from the llgand's crystallographic coordinates. Standard deviations are 
given in parentheses. 
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TABLE VI. 

Results of Uamarcklan Genetic Algorithm Dockings.' 



Energy (kcal mol -1 ) and rmsd (A) 



PDB 
code 


Number 
of 

clusters 


Number 

fn 
rank 1 


Low rat 
energy 


rmsd 

Of 
lowest 
energy 


Mean 
energy 




Mean 
rmsd 




Number of 

energy 
evaluations 


3ptb 


.1 


10 


-8-15 


0.45 


-8,15 


(0.00) 


0,46 


(0.01). 


1.50 x 10 6 


2cpp 


1 


10 


' -7.36 


0.93 


-7.36 


(0.00) 


0.93 


(0.00) 


1.50 x 10 6 


2rncp 


6 


2 


-5.54 


1.05 


-4.15 


(0.15) 


1.10 


(0.07) 


1.50 X 10 a 


Istp 


1 


10 


-10.14 


0.69 


-10,0£ 


(0.05) 


0.66 


(0.06) 


1.50 X 10 s 


1hvr 


2 

4 


9 


- 21 .38 


0.76 


-19.11 


(6.92) 


0.B5 


(0.35) 


1.50 X 10 s 


4hmg 


3 


7 


-7.72 


1.14 


-7.54- 


(0.19) 


1.1B 


(0.12) 


1.50 x 10 s 


4dfr 


a 


7 


-16.96 


1.03 


-16.90 


(0.07) 


0.96 


(0.07) 


1.56 X 10 6 



B The parameters used were 10 rune, a population size of 50, and a run-termination criterion of a mawmum of 27,000 gene rations 
or a maximum of 1.5 x 10 s energy evaluations, whichever came first. Because local search also uses energy evaluations, the total 
number of energy evaluations for the LGA method was greater than that for the GA method, using the same papulation size and 
maximum number of generations; In the LGA dockings, the runs terminated because the maximum number of energy evaluations 
was exceeded. The rmsd conformational clustering tolerance was 0.5 A, calculated from the ligand's crystaJlographic coordinates, 
Standard deviations are given in parentheses. 



energy across the ten dockings was lowest for 
LGA, followed by SA, and finally GA- This is 
reflected in a comparison of the mean rmsd of the 
docked conformation from the crys tallo graphic 
structure for each of the methods: GA had the 
highest mean rmsd, followed by SA, and on aver- 
age, the LGA produced conformations with the 
lowest crys tal lographic rmsd. Thus, considering 
their average performance, the best search method 
at finding the lowest energy and the lowest rmsd 
was the LGA. The predicted binding free energy, 
AG^.^, of the lowest docked energy structure ob- 
tained using the LGA method was —8,15 kcal 
mol" 1 (Table VII}, whereas the observed value, 
AGqu was —6.46 kcal mol" 1 ; this is within the 
estimated error of the model. 

Cytochrome P-450 CJUB /CampJior (2cpp) 

Camphor binds to the rnonooxygenase cy- 
tochrome P-450 cftm such that the 5~exo C — H band 
is hydroxylated stereospecifically- The active site is 
deeply sequestered within the enzyme, and the 
crystal structure of the complex does not possess 
an obvious substrate access channel. 70 This buried 
active site presents a more challenging docking 
problem than 3ptb. Once bound, however, the 
substrate is ''tethered'' by a hydrogen bond that is 
donated from the Tyr-96 hydroxyl to the caxborryl 
oxygen of camphor, while the subtle complemen- 
tarity of the pocket and the hydrophobic skeleton 
of camphor help to position the rest of the sub- 



strate. The lowest energy found was —736 kcal 
mol" 1 , found by both the GA (Table V) and LGA 
methods (Table VI); SA's lowest energy was — 7,29 
kcal mol" 1 (Table IV), which is practically the 
same. All methods found the crystallographic 
structure/ SA succeeding in 9 of 10 dockings, GA 
in 7 out of 10 dockings, and LGA in all of the 
dockings (with success, once again, being mea- 
sured as having a crys tall o graphic rxnsd of less 
than 1 A). In all three search method cases, the 
lowest energy, cluster was the most populated, 
with 6, 9, and 10 members using SA, GA, and 
LGA, respectively, The predicted binding free en- 
ergy r ■ AG pred , of the lowest docked energy struc- 
ture, was — 7.36 kcal mol*" 1 using the LGA rnethod 
(see Table VID, whereas the observed value, AG oW 
was —8-27 kcal mol -1 — once again/ this was 
within the estimated error of the model. 

McPC-603 / Phosplio choline (2mcp) 

Antibody molecules bind their target antigens 
with exquisite specificity/ having close comple- 
mentarity between antigen and antibody surfaces, 
hydrogen bonding, van der Waals, and electro- 
static interactions. ImjosphcK^holine binds to Fab 
McPC-603, 71 ' and is an example of recognition, is 
predominantly electrostatic in cbaracter, primarily 
due to the influence of Arg H52. 72 There is little 
corrformational change in -the side chains of Fab 
McPC-603 upon binding, as indicated from the 
unbound crystal structure. We allowed all four 



1_-V 1 ....Lit.! J 



AUTOMATED DOCKING 








(0 





FIGURE 4, A comparison of the lowest energy structure found by each search method and the crystal structure. The 
latter is shown in black. The simulated annealing results are rendered wrth a striped texture, the genetic algorithm 
results are shaded gray, and the Lamarckian genetic algorithm results are white. Oxygen atoms are shown as spheres; 
other hetaroatoms are not shown. Note that simulated annealing tailed in the last two test cases, 4hmg and 4dfr, but 
both the genetic algorithm and the Lamarckian genetic algorithm succeeded* 



bonds to rotate during docking. The energy of the 
crystal structure was positive, due in most part to 
a large positive internal energy dominated by C2 
and 01 being too close (2.26 A); this could be 
improved if local minuxtization had been per- 
formed on the crystal structure before docking. 
The lowest energy found by each of the three 
search methods were -4.09, — 5.17, and —534 
kcal moP 1 , using SA, GA, and LGA, respectively. 
Unlike 3ptb and 2cpp, these differences in energy 



were more significant. "Both SA and GA found 10 
different clusters, whereas LGA found 6 clusters. 
The mean energy of the 10 dockings was +70.89/ 
—3.61, and —4.15 kcal mol"" 1 for SA, GA, and 
LGA, respectively. Thus, on average, the LGA per- 
formed best in finding the lowest energy docked 
structure. Furthermore, the mean rmsd from the 
crystallo graphic coordinates was 5.40, 5.26, and 
1.10 A for SA, GA, and LGA, respectively, indicat- 
ing that LGA also reproduced the crystal structure 
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TABLE VII. _ ' 

Comparison of Predicted Free Energy of Binding, A G prwl , of Lowest Energy Docked Structure Obtained 
Using Lamarckian Genetic Algorithm, and Observed Fros Energy of Binding, AG o6l ," 



Energy (kcal mol~ 1 ) and rrnsd lA). 



PDB 

code 


Lowest 
energy 


rmsd of 

lowest 

energy 




AG !ntra 


A< 3(or 






(AG pred - AG obj ) 


3ptb 


-8.15 


0.45 


-8.15 


0.00 


0.00 


-8.15 


-6.46 


-1.69 


2cpp 


-7.30 


0.93 


-7,36 


0.00 


0.00 


-7.36 


-8.27 


+ 0.91 


2mcp 


-5.54 


1.05 


-6.57 


+ 1,03 


+ 1.25 


-5.32 


-7.13 


+ 1.51 


1stp 


-10,14 


0.69 


-9.30 


-0.24 


+ 1.56 


-8.34 


-18.27 


+ 9.93 b 


1hvr . 


-21,38 


0.76 


-19-34 


-2.04 


+2.48 


-16.85 


-12.96 


-3.89 


4hrng 


-7,72 


1.14 


-8.93 


+ 1.21 


+2.18 


-6.75 


-3.48 


-3.27 


4dfr 


-16.97 


1.03 


- 1 6.57 


-0.40 


+2.18 


- 14.39 


-13.22 


-1.17 



"AG, nlft/ Is thps Intermokxrular interaction energy between the ligand and the receptor, AG, nUfl Is the Intramolecular interaction 
energy of the llgand, and AG^ 0 , Is ths torsional free energy chenge of the ilgand upon binding. 

6 This largo discrepancy may be due to neglect of Hie conformational rearrangements of streptavidfn upon binding b latin, which are 
neglected In the docWng simulation and binding free energy calculation, 



most often. The predicted binding free energy, 
AGp ied , of the lowest docked energy structure wag 
—5.32 kcal mol" 1 using the LGA method (see 
Table VH), whereas the observed value, AG^, was 
— 7.13 kcal mol" 1 — this was also within the esti- 
mated error of the model. 

StreptavidJn / BloWn (lstp) 

One of the most tightly binding noncovalent 
complexes is that of streptavidin/biotin, with an 
experimentally observed dissociation constant, K d/ 
of 1Q~ 1S .M. Comparison of the apo form and the 
complex 73 shows that the high affinity results from 
several factors,, including formation of multiple 
hydrogen bonds and van der Waals interactions 
between the bio tin and the protein, in addition to 
the ordering of surface polypeptide loops of strep- 
tavidin upon binding biorrru The method that 
found the lowest energy was LGA/ at —10.14 kcal 
mnl" 1 , although GA was not significantly differ- 
ent, followed by SA with —8.48 kcal mol" 1 . Hie 
method with the lowest mean energy was LGA at 
-10.06 kcal mor\ then GA with -8.42 kcal 
mor 1 , and finally SA with -7.76 kcal mor 1 . The 
method that found the crystallo graphic complex 
coordinates most often was LGA/ having a mean 
rmsd of 0.66 A, then SA at 124 A, and finally g!a 
with 2.96 A. At the rmsd tolerance chosen for these 
experiments, 0.5 A, SA found 10 different confor- 
mational clusters, GA found 7 clusters (the most 
populated was rank 1, with 4 members), and LGA 
found 1 cluster. 1 



It was not possible to include the entropic ef- 
fects of the flexible surface loops of strep tavidin in 
the docking of biotin, although they make signifi- 
cant contributions to the binding free energy as- 
revealed by a recent set of experiments involving 
an atomic force microscope. 74 It was found that the 
unbinding forces of discrete complexes of strep ta- 
vidin with bio tin analogs were proportional to the 
enthalpy change of the complex formation but 
independent of changes in the free energy, which 
indicates that the unbinding process is adiabatic 
and that entropic changes occur after unbinding- 
This may help to explain why the predicted bind- 
ing free energy of the s trep tavidin /bio tin complex 
.(AG^) —10.14 kcal mol~\ underestimated the 
magnitude of the observed value (AG D ^ 5 ) —18.27 
kcal mol- 1 (Table VH). 

HIV-1 Protease /XK263 (Hot) 

HTV-1 protease inhibitors prevent the matura- 
tion of virions of HIV, and are a major target for 
computer-assisted drug design in the development 
of AIDS therapies. Substrates and inhibitors of 
HIV-1 protease are typically extended peptides or 
peptidornimetics, with a dozen or more freely ro- 
tatable bonds and, as such, they present a chal- 
lenging target for automated docking techniques. 
In addition, considerable protein motion is ex- 
pected in the flaps upon binding, to allow the 
continuous polypeptide to reach the active site. 
However, most docking methods use a rigid pro- 
tein target, and explicit modeling of the opening 




and closing of the flaps is not performed: thus, the 
ligand must "thread" its way into the active site. 
The cyclic urea HTV-protease inhibitor, XK-263, 
has 10 rota table bonds, excluding the cyclic urea's 
flexibility. All three search methods found solu- 
tions near to the crystal structure 75 : interestingly, 

the lowest docking energy found by SA was 

— 11.77 kcal mol"" 1 and had an rmsd of 1.15 A 
from the crystal structure, whereas GA and LGA 
found much lower energies but were still near to 
the active site, having crystallographic rmsd val- 
ues of 0,82 A and 0.76 A, respectively. The lowest 
docking energy found overall was —21.41 kcal 
mol" 1 , and was found using GA, although that 
found by LGA was practically the same. The pre- 
dicted binding free energy, AGp,. ed , of the lowest 

energy structure was —16,85 kcal mol" 1 using 
LGA, whereas the observed value, AG obfl/ was 

— 12.96 kcal rnol - " 1 . The larger discrepancy be- 
tween the predicted and observed values may be 
due to the entropic contributions of protein aide 
chain and flap conformational rearrangerruexits, or 
may be due to other low-energy conformational 
states of the cyclic urea moiety of XK-263, which 
are neglected in our calculations. 

Influenza Hemagglutinin/ Sialic Acid (41img) 

The recognition of sialic add by influenza 
hernagglutrnLn is chiefly mediated through hydro- 
gen bonding: sialic acid has five hydroxyls, three 
in the glycerol group, one cafboxylate, a cyclic 
ether oxygen, and an acetarrudo group, with a total 
of 11 rotatable acyclic bonds. We used the crystal 
structure of Weie et aL/ 6 although the low resolu- 
tion meant that the overall coordinate error was 
approximately 0-35-0.4O A, which, in itself, pre- 
sents a potential challenge in the docking tests. We 
modeled an isopr opylated derivative of sialic acid 
to mimic part of an adjacent six-membered ring 
that would normally be present in this complex, 
but was not seen due to disorder; this introduced 
an extra rotatable bond, giving a total of 11 tor- 
sions. Furthermore, in these tests, we used the 
crystal conformation of the six-membered ring, 
although normally we would use several of the 
lowest energy conformations of the ring system 
and dock these separately. 

This was one of. two cases where simulated 
annealing failed to* find a decking that was near 
the crystal structure: the lowest energy structure 
found had an rmsd of 3.77 A from the crystallo- 
graphic structure, and the docking with the lowest 
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crystallographic rmsd was 2-36 A- The mean en- 
ergy of all 10 SA dockings was very high (6.99 x 
10* kcal mol" 1 ); only 4 of the 10 SA dockings 
found negative energies. Both GA and LGA, how- 
ever, succeeded in finding ccirtfcirmatians near 
(<, 13 A rmsd) the crystal conformation. The low- 
est energy found was "by LGA, and was —7,72 kcal 
mol" 1 . This structure had a crystallographic rmsd 
of 1*14 A, and had a predicted binding free energy, 
AGpied' of — 6*75 kcal mol" 1 ; the observed binding 
free energy, AG^, for the sialic acid -hem agg- 
lutinin complex was —3.48 kcal mol~ l . The differ- 
ence in predicted and observed binding free ener- 
gies may be due to the structural differences 
between the is op ropy] ated derivative that was 
docked and sialic acid itself. 

Dihydrofolate Reductase/ Methotrexate 
(4tfgr) 

Methotrexate is an antirnefcabolite that attacks 
proliferating tissue selectively induces remissions 
in certain acute leukemias 77 ; however, dangerous 
side effects of methotrexate in normal cells con- 
tinue to make DHFR an important target in com- 
puter-assisted anticancer drug design. 78 We used 
the crystal structure of E- coli dihydrofolate reduc- 
tase complexed with methotrexate 79 to investigate 
a more challenging docking problem. We assumed 
that waters 603, 604> and 639 , which mediate hy- 
drogen bonding between the inhibitor and the 
protein, were conserved biowaters, and included 
them in the protein structure in our grid calcula- 
tions- Ideally, these should be predicted, and re- 
cently a method based on a ^-nearest-neighbors 
classifier and a genetic algorithm called Consolv 
was reported to do just this. 50 

This is erne of the two test cases where simu- 
lated annealing failed: the lowest energy structure 

that it found had an rmsd of 4.83 A from the 
crystal structure. This could be because the final 
docked c conformation in simulated annealing is 
arrived at after a series of continuous steps, and if 
the route to the active site Is blocked, the docking 
will tend to fail before the ligand reaches the 
active site. Note that, in the case of the camphor— 
cytochrome P-450^ docking, the random initial- 
ization loop was able to find initial states that were 
inside the binding pocket, but in this case the 
dockings failed to start near the active site. 

The lowest energy found was —16.98 kcal 
mol' 1 , and was found using LGA: this structure 
had an rmsd from the crystal structure of 1,03 A. 
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The predicted binding free energy, AG^^, of the 
lowest docked energy structure was — 14.39 kcal 
mol"* 1 using LGA, whereas the observed value, 
AG Dbfi/ was — 13.22 kcal mol*" 1 . This finding was 
within the estimated error of the model. 

JUDGING SEARCH METHODS 

To evaluate the new search methods, and to 
compare them with the earlier search method of 
simulated annealing, we addressed the following 
questions: Which search method is most efficient! 
That is, which finds the lowest energy in a given 
number of energy evaluations? Which search 
method is most reliable? That is, which, method 
finds the most conformations similar to that of the 
lowest energy? Finally, which, search method is 
most successful? That is, which finds the crystallo- 
graphic conformation most often after a given 
number of dockings? Furthermore, because these 
comparisons were carried out using the new, em- 
pirical free energy force field, these tests also rep- 
resent an evaluation of the force field itself, and, if 
the global minimum of the force field is unable to 
reproduce observed crystallographic structures, its 
usefulness will be limited. Because it is very diffi- 
cult to determine the global minimum of such a 
complex function, we cannot answer this question 
definitively; however, we can report the lowest 
energy found by any of the methods and its struc- 
tural similarity to that of the crystal structure. 

If we calculate statistics across all seven pro- 
tein-ligand test systems for each search method, 
we obtain a quantitative estimate of relative per- 



formance of each search method (see Table VHD. 
If, in each of the seven test systems, we assume 
that the lowest docked energy found by any 
method is the effective global rninimum energy, 
and then calculate the difference between this en- 
ergy and all of the docked energies found by each 
search- method, we can then calculate the mean 
and standard deviation of this difference energy 
for each search method- Ideally, the mean and 
standard deviation of this value would be zero. 
The mean of this difference energy was lowest for 
LGA (0.40 kcal mor 1 ), followed by GA (3,41 kcal 
moT 1 ), and finally SA (2.62 X 10 s kcal mol" 1 ): 
the very high mean difference energy for SA is 
indicative of the cases in which this method failed 
to escape a local minimum, where the ligand 
was partially or wholly trapped within the pro- 
tein- Hence, in answer to the first question, the 
Lamaxckian genetic algorithm/ LGA, is the most 
efficient search method- 
la terms of how often the structure with lowest 
energy was found, LGA performed best; the mean 
of the number of docked structures in rank 1 was 
78% for LGA, 40% for GA, and 24% -for SA- The- 
mean of the number of clusters found was lowest' 
for LGA (2-29), followed by GA (7,00), and finally 
SA (8-43). Hence, the most reliable search method 
was LGA. 

In comparing the relative success of each search 
method in reproducing the crystallographic struc- 
ture, considering the crystallo graphic rmsd across 
all 10 dockings in each of the 7 test systems, the 
mean rmsd was lowest for LGA (0.88 A, standard 



TABLE VIII . ' : 

Statistical Compart&ort of Three Search Methods In AirraDooc 3.0 Across all Bevm Test Systems. 1 



Energy (kcal mol 1 ) and rmsd (A) 













rmsd 












Number 


Number 


Difference from 


of 






Number of 


Search 




of 


in 


effective global 


lowest 


Mean 


Mean 


energy 


method 


Statistic 


clusters 


rank 1 


minimum energy 


energy 


energy 


rmsd 


evaluations 


SA 


Mean 


8.43 


2.43 


2.62 X 10 5 


1.B5 


2.61 x 10 s 


3.63 


175 x 10 6 


SD 


2.70 


2,44 


1.40 X 10 5 


1.74 


4.60 X 10 4 


2.61 


4.15 X 10 s 


GA 


Mean 


7.00 


4.00 


3.41 


0.82 


-7.64 


3.08 


1.35 X 10 a 




SD 


3.08 


3.06 


5.31 


0.25 


2.59 


1.32 


0.00 


LGA 


Mean 


£29 


7.88 


0.40 


0.66 


-10,47 


0.68 


1.50 X 10 6 


SD 


1.80 


2.91 


2.62 


0.24 


5-47 


0.25 


0-00. 



"The search methods are simulated annealing (SA), genetic algorithm (GA), and lamsrrfqan genetic aigorithm (LQa). The mean 
and standard deviation (SD) tor each criterion Is shown. The effective global minimum energy for each of The seven test systems is 
the lowest docked energy found by any method for that test system. For each of the 1 0 dockings, the difference between the final 
docked energy and this effective global minimum energy was calculated; the mean and srtandaid deviation was calculated across 
ail 7 test systems, which was repeated for each search method. 
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TABLE DC 

Resorts of Crow-ValldaUon of Free Energy 
Function Using Local Search on 20 HiV-1 
Pratea&a-InhlbHo* Complexea. 



f^l 

PDB 
code 


Experimental 

^binding 

ft is A 

(kcal/mol) 


Calculated 
(kcal / moO 


i nvs 


— 14.04 


— 10.95 


^ Ut lie 


- 13,79 


— 11. 60 


i nvi 


— 13.74 


— 12.38 


/nvp 


— lo.l I 


— 1 2„ 1 8 


1nps 


-12.57 


-11.80 


1hpv 


-12.57 


-8.24 


4phv 


-1231 


-14.36 


Ihef 


-12.27 


-B.52L 


1h/v 


- 1 2.27 


-13.02 


Ihvl 


-12.27 


-10.35 


Bftvp 


-12.27 


-9.36 


laaq 


-11.62 


-9.68 


Jhtg 


-11.58 


-13.13 


9hvp 


-11.38. 


- 1 0.54 


Ihih 


- 1 0.97 


-11.43 


1heg 


-10.56 


-8,60 


1sbg 


-10.56 


-10.35 


1htf 


-9.31 


-8-21 


Ihbv 


-8.68 


-9.75 


1hte 


-7.69 


-7.28 



deviation 0,25 A), followed by GA (3.06 A, stan- 
dard deviation 132 A], and finally SA C3.63 A, 
standard deviation. 2,61 Al These average results 
indicate that/ of the three search methods, LGA 
will find the crystal! ographic structure most ofteru 
Trvus, the answer to the last question, "Which 
method is most successful?/' is LGA. 

In two different cases, 4hmgand 4dfr > the simu- 
lated annealing method failed to reproduce the 
corresponding crystal structure, although it suc- 
ceeded with lhvr (see Fig. 4). This is important 
because methotrexate has 7 rota table bonds, and 
would be expected to be solvable using our ruie- 
of-thumb that SA succeeds in problems with 8 
torsions or less; however, the HLV-1 protease in- 
hibitor XK-263, has 10 rotatable bonds, and was 
successfully docked using SA. Thus, the degree of 
difficulty of a docking problem is not as simple as 
how many rotatable bonds there are; other factors, 
such as the nature of the energy landscape, clearly 
play an important role. 

It could be said that the crystallographic rmsd 
of the lowest energy structure found by any of the 
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search methods is an estimate of the quality of the 
force field, although this is complicated by the fact 
that the search method itself must determine a 
docking near to the global rrunimurrv, an unknown 
state. We can calculate the energy of the ligand in 
the crystal structure using the new force field (see 
Table H), which we assume to be near the global 
rrunirnurn, but, unfortunately, the crystal structure 
may contain frustrations and bad contacts. This 
appears to be the case in 2mcp / where a close 
contact between C2 and Ol causes a positive total 
energy to be calculated for the crystal structure. In 
all cases, the lowest energy found, ccmsidering all 
the search methods, was lower than that of the 
corresponding crystal structure. 

The crystallographic rmsd of the lowest energy 
(found by any search method) for each of the 
protein-ligand test systems were all within 1.14 A, 
or less, of the crystal structure. This suggests that 
the force field's global rninimum in each of the 
protein-ligand cases was near to the crystal struc- 
ture, if we accept the assumption that the crystal 
structure was neat to or at the global mbiimum, 
and that the lowest energy found was near to the- 
global minimum. In some cases, dockings were 
found that had lower crystallographic rmsd values 
bnt'sMghrly higher energies than the lowest energy 
found. All of the lowest crystallographic rmsd 
values were 0.89 A. or less, indicating that low-en- 
ergy structures found by the force field were very 
similar to the corresponding crystal structure- 




AutoBoqc is a software package of general 
applicability for automated docking of small 
molecules, such as peptides, enzyme inhibitors, 
and drugs, to macromolecules, such as proteins, 
enzymes, antibodies, DNA, and RNA. New search 
methods have been introduced and tested here, 
using a new, empirical binding free energy func- 
tion for calculating Hgand-receptor binding af run- 
ties - 

We have shown that, of the three search, meth- 
ods tested in AUTODoac (simulated armealing, 
genetic algorithm, and Lamarckian genetic algo- 
rithm), the most efficient, reliable, and successful 
is the Lamarckian genetic algorithm LGA. We de- 
fined efficiency of search in terms of lowest energy 
found in a given number of energy evaluations; 
reliability in terms of reprc^utibility of finding the 
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lowest energy structure in independent dockings, 
ag measured by the number of conformations in 
the top ranked duster; and success in terms of 
reproducing the known crystal structure. Simu- 
lated annealing failed to reproduce the crystal 
structures for the influenza hemagglutin-sialic acid 
complex (4hmg) ,and the dihydrofolate reductase- 
methotrexate complex (4dfrX However, both the 
genetic algorithm and the Lamarckian genetic al- 
gorithm methods succeeded. Thus, the introduc- 
tion of the LGA search method extends the power 
and applicability of AutoDock to docking prob- 
lems with more degrees of freedom than could be 
handled by earlier versions. 

The predicted binding affinities of the lowest 
energy docked conformations, using the LGA 
method and the new empirical free energy func- 
tion, were within the standard residual error of the 
force field in four of the seven cases (3ptb, 2cpp, 
2mcp, and 4dfr), and reasonably close in two other 
cases (Uivx and 4hmg). The large discrepancy be- 
tween the predicted and the observed binding 
affinity of biotin for streptavidin (lstp), even 
though the crystal structure was successfully re- 
produced, may be due to the large free energy 
change that accompanies conformational changes 
in the protein upon binding, in particular the sur- 
face loops. This remains a limitation of the method, 
because protein motion is not modeled and suc- 
cessfully predicting such large-scale protein con- 
formational changes is difficult. The AutoDock 
method works well when there is little change 
between the apo and ligand-bound forms of the 
protein, even if the protein undergoes significant 
conformational changes during binding, 

AutoDock predicts the binding affinity using 
one conformation of the ligand— protein complex, 
A new class of models for predicting receptor-lig- 
and binding affinities has been reported recently 
that considers not just the lowest energy state of 
the complex, but the predominant states of the bind- 
ing molecules. 81 These approaches are grounded in 
statistical thermodynamics, and combine a modest 
set of degrees of freedom with aggressive confor- 
mational sampling to identify the low-energy con- 
formations of the complex and the free molecules. 
AutoDock version 3.0 currently performs exten- 
sive conformational sampling, information that 
could be incorporated into the calculation of the 
binding affinity. We are studying how the search 
methods can be modified such that statistical ther- 
modynamics calculations can be performed while 



the docking proceeds, to improve the calculation 
of the binding affinity. 



Availability 

More information about AutoDooc and how to 
obtain it can be found on the World Wide Web 

at: http;//www.scrlpps.edu/pub/olson-web / 
doc/autodock. 
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EXHIBIT C 



MSNWDTKFLKKGYTFDDVLLIPAESHVLPNEVDLKTKLADNLTL 

NIPIITAAMDTVTGSKMAIAIARAGGLGVIHKNMSITEQAEEVRKVKRSENGVIIDPF 
FLTPEHKVSEAEELMQRYRISGVPIVETLANRKLVGIITNRDMRFISDYNAPISEHMT 
SEHLVTAAVGTDLETAERILHEHRIEKLPLVDNSGRLSGLITIKDIEKVIEFPHAAKD 
E FGRLLVAAAVGVT S DT FERAE AL FE AGADAI VI DT AHGH S AGVLRK I AE I RAHFPNR 
TLIAGNIATAEGARALYDAGVDVVKVGIGPGSICTTRVVAGVGVPQVTAIYDAAAVAR 
EYGKTIIADGGIKYSGDIVKALAAGGNAVMLGSMFAGTDEAPGETEI YQGRKFKTYRG 
VGAGDIQELHENAQFVEMSGAGLIESHPHDVQITNEAPNYSV'' 



